# Data Preprocessing & Cleaning
* Importing and exploring datasets.
* Handling missing values and outliers.
* Data normalization and scaling.
* Feature selection and engineering.
* Hands-on: Cleaning a real-world dataset.


# 1. Importing and Exploring Datasets
Before working with data, we need to import it and understand its structure. This involves checking data types, missing values, basic statistics, and data distribution.

In [None]:
import pandas as pd

df = pd.read_csv("./Data/projectile_motion_data.csv") # importing the data
df

Unnamed: 0,Initial Velocity (m/s),Launch Angle (degrees),Gravity (m/s²),Time of Flight (s),Maximum Height (m),Range (m),Final Velocity (m/s)
0,35.704696,29.621646,9.81,3.597915,15.873795,111.673470,35.704696
1,41.256605,16.854316,9.81,2.438717,7.292927,96.291353,41.256605
2,35.214240,45.086542,9.81,5.084161,31.696959,126.405404,35.214240
3,24.008337,74.626229,9.81,4.719520,27.313328,30.039574,24.008337
4,10.615840,53.304853,9.81,1.735384,3.692923,11.008535,10.615840
...,...,...,...,...,...,...,...
995,20.817931,37.937856,9.81,2.609378,8.349355,42.842447,20.817931
996,18.229606,47.936391,9.81,2.759162,9.335407,33.697695,18.229606
997,46.602734,64.246195,9.81,8.557320,89.795505,173.278245,46.602734
998,20.798318,77.749996,9.81,4.143683,21.054842,18.285823,20.798318


In [11]:
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

Rows: 1000, Columns: 7


In [12]:
print(df.isnull().sum()) # Checking for missing values

Initial Velocity (m/s)    0
Launch Angle (degrees)    0
Gravity (m/s²)            0
Time of Flight (s)        0
Maximum Height (m)        0
Range (m)                 0
Final Velocity (m/s)      0
dtype: int64


In [8]:
df.describe() # Descriptive statistics

Unnamed: 0,Initial Velocity (m/s),Launch Angle (degrees),Gravity (m/s²),Time of Flight (s),Maximum Height (m),Range (m),Final Velocity (m/s)
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,27.65994,43.913658,9.81,3.722127,23.859054,74.425397,27.65994
std,13.009542,19.756287,6.398084e-14,2.368185,26.813351,63.995819,13.009542
min,5.040648,10.005071,9.81,0.261268,0.083705,1.243978,5.040648
25%,16.737346,26.898061,9.81,1.726966,3.65719,20.056532,16.737346
50%,27.209961,43.226227,9.81,3.131978,12.028655,55.244295,27.209961
75%,39.139619,60.571917,9.81,5.351349,35.116042,117.217496,39.139619
max,49.998381,79.99518,9.81,9.8207,118.267079,253.15595,49.998381


## Key Takeaways
* .head() gives an overview of the data.
* .shape tells the number of rows and columns.
* .describe() provides statistical summaries.
* .isnull().sum() helps identify missing values.
* .dtypes shows the types of variables.

# 2. Handling Missing Values and Outliers
Missing values and outliers can distort the results of a machine learning model. We need to either remove or impute missing values and handle outliers effectively.

* Handling Missing Values
* Remove missing values if they are too many.
* Impute missing values using mean, median, mode, or advanced techniques.

In [28]:
import numpy as np 
import pandas as pd

data = pd.read_csv('Data\employees.csv')
data

  data = pd.read_csv('Data\employees.csv')


Unnamed: 0,Name,Age,Salary,Department,Experience (Years)
0,Alice,25.0,50000.0,HR,1.0
1,Bob,,60000.0,IT,3.0
2,Charlie,30.0,,Finance,5.0
3,David,45.0,80000.0,IT,10.0
4,Eve,,75000.0,HR,7.0
5,Frank,28.0,,,
6,Grace,35.0,90000.0,Finance,8.0
7,Hannah,,65000.0,HR,6.0
8,,40.0,,IT,
9,Jack,50.0,100000.0,,15.0


In [29]:
print(data.isnull().sum())


Name                  1
Age                   3
Salary                3
Department            2
Experience (Years)    2
dtype: int64


In [38]:
data.loc[:, "Age"] = data["Age"].fillna(data["Age"].mean())  # Fill numeric NaNs with mean
data["Age"] = data["Age"].astype(int)

data.loc[:, "Salary"] = data["Salary"].fillna(data["Salary"].mean())  # Fill numeric NaNs with mean
# data["Salary"] = data["Salary"].astype(int)

data.loc[:, "Experience (Years)"] = data["Experience (Years)"].fillna(data["Experience (Years)"].mean())  # Fill numeric NaNs with mean
data["Experience (Years)"] = data["Experience (Years)"].astype(int)
data


Unnamed: 0,Name,Age,Salary,Department,Experience (Years)
0,Alice,25,50000,HR,1
1,Bob,36,60000,IT,3
2,Charlie,30,74285,Finance,5
3,David,45,80000,IT,10
4,Eve,36,75000,HR,7
5,Frank,28,74285,HR,6
6,Grace,35,90000,Finance,8
7,Hannah,36,65000,HR,6
8,,40,74285,IT,6
9,Jack,50,100000,HR,15


In [42]:
# Fill NaN values for all numeric columns at once
data.loc[:, ["Age", "Salary", "Experience (Years)"]] = data.loc[:, ["Age", "Salary", "Experience (Years)"]].apply(lambda x: x.fillna(x.mean()))

# Convert specific columns to integers
data = data.astype({"Age": int, "Experience (Years)": int})  

# Print the cleaned DataFrame
data

Unnamed: 0,Name,Age,Salary,Department,Experience (Years)
0,Alice,25,50000,HR,1
1,Bob,36,60000,IT,3
2,Charlie,30,74285,Finance,5
3,David,45,80000,IT,10
4,Eve,36,75000,HR,7
5,Frank,28,74285,HR,6
6,Grace,35,90000,Finance,8
7,Hannah,36,65000,HR,6
8,Alice,40,74285,IT,6
9,Jack,50,100000,HR,15


In [39]:
data['Department'] = data["Department"].fillna(data["Department"].mode()[0])  # Fill categorical NaNs with mode
data["Name"] = data["Name"].fillna(data["Name"].mode()[0])
data

Unnamed: 0,Name,Age,Salary,Department,Experience (Years)
0,Alice,25,50000,HR,1
1,Bob,36,60000,IT,3
2,Charlie,30,74285,Finance,5
3,David,45,80000,IT,10
4,Eve,36,75000,HR,7
5,Frank,28,74285,HR,6
6,Grace,35,90000,Finance,8
7,Hannah,36,65000,HR,6
8,Alice,40,74285,IT,6
9,Jack,50,100000,HR,15


In [40]:
# solving nan values
data1 = data.dropna()
data1

Unnamed: 0,Name,Age,Salary,Department,Experience (Years)
0,Alice,25,50000,HR,1
1,Bob,36,60000,IT,3
2,Charlie,30,74285,Finance,5
3,David,45,80000,IT,10
4,Eve,36,75000,HR,7
5,Frank,28,74285,HR,6
6,Grace,35,90000,Finance,8
7,Hannah,36,65000,HR,6
8,Alice,40,74285,IT,6
9,Jack,50,100000,HR,15
