# Data Preprocessing & Cleaning
* Importing and exploring datasets.
* Handling missing values and outliers.
* Data normalization and scaling.
* Feature selection and engineering.
* Hands-on: Cleaning a real-world dataset.


# 1. Importing and Exploring Datasets
Before working with data, we need to import it and understand its structure. This involves checking data types, missing values, basic statistics, and data distribution.

In [None]:
import pandas as pd

df = pd.read_csv("./Data/projectile_motion_data.csv") # importing the data
df

Unnamed: 0,Initial Velocity (m/s),Launch Angle (degrees),Gravity (m/s²),Time of Flight (s),Maximum Height (m),Range (m),Final Velocity (m/s)
0,35.704696,29.621646,9.81,3.597915,15.873795,111.673470,35.704696
1,41.256605,16.854316,9.81,2.438717,7.292927,96.291353,41.256605
2,35.214240,45.086542,9.81,5.084161,31.696959,126.405404,35.214240
3,24.008337,74.626229,9.81,4.719520,27.313328,30.039574,24.008337
4,10.615840,53.304853,9.81,1.735384,3.692923,11.008535,10.615840
...,...,...,...,...,...,...,...
995,20.817931,37.937856,9.81,2.609378,8.349355,42.842447,20.817931
996,18.229606,47.936391,9.81,2.759162,9.335407,33.697695,18.229606
997,46.602734,64.246195,9.81,8.557320,89.795505,173.278245,46.602734
998,20.798318,77.749996,9.81,4.143683,21.054842,18.285823,20.798318


In [11]:
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

Rows: 1000, Columns: 7


In [12]:
print(df.isnull().sum()) # Checking for missing values

Initial Velocity (m/s)    0
Launch Angle (degrees)    0
Gravity (m/s²)            0
Time of Flight (s)        0
Maximum Height (m)        0
Range (m)                 0
Final Velocity (m/s)      0
dtype: int64


In [8]:
df.describe() # Descriptive statistics

Unnamed: 0,Initial Velocity (m/s),Launch Angle (degrees),Gravity (m/s²),Time of Flight (s),Maximum Height (m),Range (m),Final Velocity (m/s)
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,27.65994,43.913658,9.81,3.722127,23.859054,74.425397,27.65994
std,13.009542,19.756287,6.398084e-14,2.368185,26.813351,63.995819,13.009542
min,5.040648,10.005071,9.81,0.261268,0.083705,1.243978,5.040648
25%,16.737346,26.898061,9.81,1.726966,3.65719,20.056532,16.737346
50%,27.209961,43.226227,9.81,3.131978,12.028655,55.244295,27.209961
75%,39.139619,60.571917,9.81,5.351349,35.116042,117.217496,39.139619
max,49.998381,79.99518,9.81,9.8207,118.267079,253.15595,49.998381


## Key Takeaways
* .head() gives an overview of the data.
* .shape tells the number of rows and columns.
* .describe() provides statistical summaries.
* .isnull().sum() helps identify missing values.
* .dtypes shows the types of variables.

# 2. Handling Missing Values and Outliers
Missing values and outliers can distort the results of a machine learning model. We need to either remove or impute missing values and handle outliers effectively.

* Handling Missing Values
* Remove missing values if they are too many.
* Impute missing values using mean, median, mode, or advanced techniques.

In [18]:
# Remove rows with missing values
df_cleaned = df.dropna()
df_cleaned

Unnamed: 0,Initial Velocity (m/s),Launch Angle (degrees),Gravity (m/s²),Time of Flight (s),Maximum Height (m),Range (m),Final Velocity (m/s)
0,35.704696,29.621646,9.81,3.597915,15.873795,111.673470,35.704696
1,41.256605,16.854316,9.81,2.438717,7.292927,96.291353,41.256605
2,35.214240,45.086542,9.81,5.084161,31.696959,126.405404,35.214240
3,24.008337,74.626229,9.81,4.719520,27.313328,30.039574,24.008337
4,10.615840,53.304853,9.81,1.735384,3.692923,11.008535,10.615840
...,...,...,...,...,...,...,...
995,20.817931,37.937856,9.81,2.609378,8.349355,42.842447,20.817931
996,18.229606,47.936391,9.81,2.759162,9.335407,33.697695,18.229606
997,46.602734,64.246195,9.81,8.557320,89.795505,173.278245,46.602734
998,20.798318,77.749996,9.81,4.143683,21.054842,18.285823,20.798318


In [19]:
# Fill missing values with mean (for numerical columns)
df["Initial Velocity (m/s)"].fillna(df["Initial Velocity (m/s)"].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Initial Velocity (m/s)"].fillna(df["Initial Velocity (m/s)"].mean(), inplace=True)


In [None]:
# Fill missing values with mode (for categorical columns)
# df["Embarked"].fillna(df["Embarked"].mode()[0], inplace=True)

In [20]:
# Check again for missing values
print(df.isnull().sum())


Initial Velocity (m/s)    0
Launch Angle (degrees)    0
Gravity (m/s²)            0
Time of Flight (s)        0
Maximum Height (m)        0
Range (m)                 0
Final Velocity (m/s)      0
dtype: int64
