## Data Preprocessing

### Importing Required Libraries
In this section, we import the necessary libraries for data manipulation and machine learning.

In [36]:
import pandas as pd

### Loading the Dataset


We load the dataset into a Pandas DataFrame and display its first few rows.

In [37]:
df = pd.read_csv('country_dataset.csv')
df.shape
df.isnull().sum()

Country     0
Age         3
Salary      3
Purchase    0
dtype: int64

### Handling Missing Values


We use `SimpleImputer` from Scikit-learn to fill missing values in the dataset.

In [38]:
# pip install scikit-learn

### Encoding Categorical Data


Categorical features are transformed into numerical representations using encoding techniques.

In [39]:
import  numpy as np
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan,strategy='mean')
df = imputer.fit_transform(df)

ValueError: Cannot use mean strategy with non-numeric data:
could not convert string to float: 'India'

In [40]:
df['Country'].unique()

array(['India', 'France', 'Germany'], dtype=object)

In [41]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df_encoded = df
df_encoded['Country'] = label_encoder.fit_transform(df['Country'])

In [42]:
df_encoded['Country'].unique()

array([2, 0, 1])

In [43]:
df_encoded['Purchase'] = label_encoder.fit_transform(df['Purchase'])
df_encoded['Purchase'].unique()

array([0, 1])

### Here we are once again try to encode the dataset


In [44]:
df_encoded = imputer.fit_transform(df_encoded)

In [45]:
df_encoded

array([[2.00000000e+00, 3.80000000e+01, 4.80000000e+04, 0.00000000e+00],
       [2.00000000e+00, 3.80000000e+01, 4.80000000e+04, 1.00000000e+00],
       [2.00000000e+00, 3.80000000e+01, 4.80000000e+04, 0.00000000e+00],
       [2.00000000e+00, 3.50000000e+01, 5.80000000e+04, 0.00000000e+00],
       [2.00000000e+00, 3.50000000e+01, 5.80000000e+04, 1.00000000e+00],
       [2.00000000e+00, 3.50000000e+01, 5.80000000e+04, 0.00000000e+00],
       [2.00000000e+00, 5.00000000e+01, 8.80000000e+04, 0.00000000e+00],
       [2.00000000e+00, 5.00000000e+01, 8.80000000e+04, 1.00000000e+00],
       [2.00000000e+00, 5.00000000e+01, 8.80000000e+04, 0.00000000e+00],
       [0.00000000e+00, 4.30000000e+01, 4.50000000e+04, 1.00000000e+00],
       [0.00000000e+00, 4.30000000e+01, 4.50000000e+04, 0.00000000e+00],
       [0.00000000e+00, 4.30000000e+01, 4.50000000e+04, 1.00000000e+00],
       [0.00000000e+00, 4.30000000e+01, 4.50000000e+04, 1.00000000e+00],
       [0.00000000e+00, 4.80000000e+01, 6.50000000e

In [46]:
df_encoded = pd.DataFrame(df_encoded)
df_encoded.isnull().sum()

0    0
1    0
2    0
3    0
dtype: int64

In [47]:
df_encoded.describe()

Unnamed: 0,0,1,2,3
count,34.0,34.0,34.0,34.0
mean,0.794118,41.516129,63451.612903,0.529412
std,0.844928,6.198098,13796.878676,0.50664
min,0.0,30.0,45000.0,0.0
25%,0.0,37.0,53000.0,0.0
50%,1.0,41.516129,63451.612903,1.0
75%,1.75,48.0,77000.0,1.0
max,2.0,50.0,88000.0,1.0


## Separating the dependent and independent variables in x and y

In [48]:
 x = df_encoded.iloc[:,0:3]

In [49]:
y = df_encoded.iloc[:,-1]

In [50]:
y.shape,x.shape

((34,), (34, 3))

#### We used MinMaxScaler from sklearn.preprocessing to normalize the feature values in x to a range between 0 and 1, ensuring all features have a consistent scale.

In [51]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
x_scale = pd.DataFrame(scaler.fit_transform(x))
x_scale.head()

Unnamed: 0,0,1,2
0,1.0,0.4,0.069767
1,1.0,0.4,0.069767
2,1.0,0.4,0.069767
3,1.0,0.25,0.302326
4,1.0,0.25,0.302326


### We used StandardScaler from sklearn.preprocessing to standardize the features in x by transforming them to have a mean of 0 and a standard deviation of 1, making the data normally distributed.

In [52]:
from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()
pd.DataFrame(standardScaler.fit_transform(x)).head()

Unnamed: 0,0,1,2
0,1.448664,-0.575823,-1.136777
1,1.448664,-0.575823,-1.136777
2,1.448664,-0.575823,-1.136777
3,1.448664,-1.067121,-0.401076
4,1.448664,-1.067121,-0.401076


##### This code splits the dataset into training and testing sets using train_test_split. 30% of the data is assigned to the test set (test_size=0.30), while 70% is kept for training. The random_state=2 ensures reproducibility, meaning the same split will occur every time the code runs. The .shape method is used to check the dimensions of the resulting datasets.

In [53]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.30,random_state=2)
x_train.shape,x_test.shape,y_train.shape,y_test.shape

((23, 3), (11, 3), (23,), (11,))

## Thank You