### Perform various data preprocessing techniques like handling missing data and feature scaling.

#### step 1: Start by importing the necessary Python libraries for data preprocessing.


In [180]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler

#### Step 2: Load the placement dataset into a Pandas Dataframe.

In [218]:
DF=pd.read_csv("Automobile.csv")
DF.head()

Unnamed: 0,name,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
0,chevrolet chevelle malibu,18.0,8.0,307.0,130.0,3504.0,12.0,70,usa
1,buick skylark 320,15.0,8.0,350.0,165.0,3693.0,11.5,70,usa
2,plymouth satellite,18.0,8.0,318.0,150.0,3436.0,11.0,70,usa
3,amc rebel sst,16.0,8.0,304.0,150.0,3433.0,12.0,70,usa
4,ford torino,17.0,,302.0,140.0,3449.0,10.5,70,usa


#### Step 3:Take a quick look at the data to understand its structure and identify any missing values or anomalies.

In [220]:
DF.info()
DF.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   name          398 non-null    object 
 1   mpg           398 non-null    float64
 2   cylinders     395 non-null    float64
 3   displacement  395 non-null    float64
 4   horsepower    386 non-null    float64
 5   weight        396 non-null    float64
 6   acceleration  395 non-null    float64
 7   model_year    398 non-null    int64  
 8   origin        398 non-null    object 
dtypes: float64(6), int64(1), object(2)
memory usage: 28.1+ KB


(398, 9)

#### The method isnull() checks each element in the DataFrame (or Series) to see if it is NaN (Not a Number) or None (missing value).
It returns a DataFrame (or Series) of the same shape as the input, with Boolean values:
#### True: The value is null (NaN or None).
#### False: The value is not null.

In [222]:
DF.isnull().sum()


name             0
mpg              0
cylinders        3
displacement     3
horsepower      12
weight           2
acceleration     3
model_year       0
origin           0
dtype: int64

#### Step 4: Handle Missing Data
#### Option 1: If the dataset is large and only a small percentage of data is missing, you can remove rows with missing values using dropna(subset,inplace)


In [226]:
DF.dropna(subset=["acceleration"],inplace=True)


In [228]:
DF.info()

<class 'pandas.core.frame.DataFrame'>
Index: 395 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   name          395 non-null    object 
 1   mpg           395 non-null    float64
 2   cylinders     392 non-null    float64
 3   displacement  392 non-null    float64
 4   horsepower    384 non-null    float64
 5   weight        393 non-null    float64
 6   acceleration  395 non-null    float64
 7   model_year    395 non-null    int64  
 8   origin        395 non-null    object 
dtypes: float64(6), int64(1), object(2)
memory usage: 30.9+ KB


#### Option 2:If removing data isn't ideal, you can impute (df.[""].fillna(df[""].mean(),inplace)) missing values using methods like mean, median, or most frequent.

In [230]:
DF["cylinders"].fillna(DF["cylinders"].mean(),inplace=True)
DF["horsepower"].fillna(DF["horsepower"].mean(),inplace=True)
DF["model_year"].fillna(DF["model_year"].mean(),inplace=True)
DF.info()

<class 'pandas.core.frame.DataFrame'>
Index: 395 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   name          395 non-null    object 
 1   mpg           395 non-null    float64
 2   cylinders     395 non-null    float64
 3   displacement  392 non-null    float64
 4   horsepower    395 non-null    float64
 5   weight        393 non-null    float64
 6   acceleration  395 non-null    float64
 7   model_year    395 non-null    int64  
 8   origin        395 non-null    object 
dtypes: float64(6), int64(1), object(2)
memory usage: 30.9+ KB


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  DF["cylinders"].fillna(DF["cylinders"].mean(),inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  DF["horsepower"].fillna(DF["horsepower"].mean(),inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on

#### Step 5: Feature Scaling


<img src="https://i.postimg.cc/G21gMYnF/f.png" alt="Image Description" width="500">









 Option 1: This method scales the data to have a mean of 0 and a standard deviation of 1.
### StandardScaler()

In [232]:
c=["cylinders","displacement","model_year","mpg"]
SC1=StandardScaler()
DF[c]=SC1.fit_transform(DF[c])
DF.head()

Unnamed: 0,name,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
0,chevrolet chevelle malibu,-0.701885,1.507356,1.08359,130.0,3504.0,12.0,-1.642521,usa
1,buick skylark 320,-1.085384,1.507356,1.495293,165.0,3693.0,11.5,-1.642521,usa
2,plymouth satellite,-0.701885,1.507356,1.188909,150.0,3436.0,11.0,-1.642521,usa
3,amc rebel sst,-0.957551,1.507356,1.054866,150.0,3433.0,12.0,-1.642521,usa
4,ford torino,-0.829718,5.253353e-16,1.035718,140.0,3449.0,10.5,-1.642521,usa


#### Option 2:This method scales the data to a fixed range, usually between 0 and 1.
###  MinMaxScaler()

In [234]:
c=["cylinders","displacement","model_year","mpg"]
SC2=MinMaxScaler()
DF[c]=SC2.fit_transform(DF[c])
DF.head()

Unnamed: 0,name,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
0,chevrolet chevelle malibu,0.239362,1.0,0.617571,130.0,3504.0,12.0,0.0,usa
1,buick skylark 320,0.159574,1.0,0.728682,165.0,3693.0,11.5,0.0,usa
2,plymouth satellite,0.239362,1.0,0.645995,150.0,3436.0,11.0,0.0,usa
3,amc rebel sst,0.18617,1.0,0.609819,150.0,3433.0,12.0,0.0,usa
4,ford torino,0.212766,0.490306,0.604651,140.0,3449.0,10.5,0.0,usa


####  Step 6:Separate the dataset into features (X) and target (y) variables. The target is usually the column you want to predict.

In [238]:
X=DF[["name","mpg","cylinders","displacement","horsepower","weight","model_year","origin"]]
Y=DF["acceleration"]
Y.head()


0    12.0
1    11.5
2    11.0
3    12.0
4    10.5
Name: acceleration, dtype: float64


### Step 7: After preprocessing, save the cleaned and scaled dataset to a new CSV file


In [240]:
final=pd.concat([X,Y],axis=1)
final.to_csv("PreAuto.csv",index=False)

In [215]:
# Lab-1 Activities

#Perform data preprocesing for Automobile.csv

#i. Delete the column horsepower since it has few missing values

#ii. Impute missing with meadin

#iii. Apply min-max scaling and standardization on the Automobiles.csv and provide the reasoning which feature scaling method make more sense to this dataset.