# *Task 1: Handling Missing Data – Titanic Dataset*

**Introduction:** This task identifies and resolves missing values in the Titanic dataset to prepare it for machine learning.

**Techniques:** Median imputation for numerical data, Mode imputation for categorical data, and Column dropping for features with excessive missingness.

**Reason:** Median is used for 'Age' to avoid outlier influence; Mode is used for 'Embarked' as it is categorical; 'Cabin' is dropped because over 70% of its data is missing.

In [16]:
import pandas as pd 

In [17]:
df = pd.read_csv('/kaggle/input/competitions/titanic/train.csv')

# Missing Values BEFORE Handling
print("Missing Values Before Preprocessing:\n")
missing_before = df.isnull().sum()
print(missing_before)

# 1. Median Imputation for 'Age'
df['Age'] = df['Age'].fillna(df['Age'].median())

# 2. Mode Imputation for 'Embarked' 
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

# 3. Dropping 'Cabin' column (too many missing values)
df.drop(columns=['Cabin'], inplace=True)

# 4. Dropping rows with any remaining nulls (if any)
df.dropna(inplace=True)

# Missing Values AFTER Handling
print("\nMissing Values After Preprocessing:\n")
missing_after = df.isnull().sum()
print(missing_after)

Missing Values Before Preprocessing:

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Missing Values After Preprocessing:

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64


# *Task 2: Feature Encoding – Car Evaluation Dataset*

**Introduction:** This task converts categorical text data into numerical format using two different encoding techniques.

**Techniques:** One-Hot Encoding and Label Encoding.

**Reason:** Label Encoding is used for the 'target_class' to maintain a single column, while One-Hot Encoding is used for features to avoid implying an incorrect mathematical order between categories like 'vhigh' and 'med'.

In [18]:
from sklearn.preprocessing import LabelEncoder

# 1. Load data and manually assign the correct names
# 'unacc' is the target (Class), others are features
col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'target_class']
path = '/kaggle/input/datasets/ayeshaaniqa/car-evaluation/car.data' 
df_car = pd.read_csv(path, names=col_names)

# 2. Label Encoding (Applied to the 'target_class' column)
le = LabelEncoder()
df_car['target_class_encoded'] = le.fit_transform(df_car['target_class'])

# 3. One-Hot Encoding (Applied to all feature columns)
df_encoded = pd.get_dummies(df_car, columns=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'])

# Displaying result
df_encoded.head()

Unnamed: 0,target_class,target_class_encoded,buying_high,buying_low,buying_med,buying_vhigh,maint_high,maint_low,maint_med,maint_vhigh,...,doors_5more,persons_2,persons_4,persons_more,lug_boot_big,lug_boot_med,lug_boot_small,safety_high,safety_low,safety_med
0,unacc,2,False,False,False,True,False,False,False,True,...,False,True,False,False,False,False,True,False,True,False
1,unacc,2,False,False,False,True,False,False,False,True,...,False,True,False,False,False,False,True,False,False,True
2,unacc,2,False,False,False,True,False,False,False,True,...,False,True,False,False,False,False,True,True,False,False
3,unacc,2,False,False,False,True,False,False,False,True,...,False,True,False,False,False,True,False,False,True,False
4,unacc,2,False,False,False,True,False,False,False,True,...,False,True,False,False,False,True,False,False,False,True


# *Task 3: Feature Scaling – Wine Quality Dataset*

**Introduction:** This task involves adjusting the scale of numerical features in the Wine Quality dataset to ensure that no single feature dominates the model due to its magnitude.

**Technique:** Normalization (Min-Max Scaling) and Standardization (Z-score Scaling).

**Reason:** Normalization is used to bound features between 0 and 1, which is useful for algorithms like KNN. Standardization is used to center the data around a mean of 0 with a standard deviation of 1, which is preferred for linear models and PCA.

In [19]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

path = '/kaggle/input/datasets/ayeshaaniqa/red-wine-quality/winequality-red.csv' 
df_wine = pd.read_csv(path, sep=';')

# Select numerical features for scaling (exclude target 'quality' if present)
features = df_wine.drop(columns=['quality'])

# 1. Normalization (Min-Max Scaling)
scaler_minmax = MinMaxScaler()
df_normalized = pd.DataFrame(scaler_minmax.fit_transform(features), columns=features.columns)
print("Normalized: ", df_normalized.head())

# 2. Standardization (Standard Scaling)
scaler_standard = StandardScaler()
df_standardized = pd.DataFrame(scaler_standard.fit_transform(features), columns=features.columns)
print("Standardized: ", df_standardized.head())


Normalized:     fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0       0.247788          0.397260         0.00        0.068493   0.106845   
1       0.283186          0.520548         0.00        0.116438   0.143573   
2       0.283186          0.438356         0.04        0.095890   0.133556   
3       0.584071          0.109589         0.56        0.068493   0.105175   
4       0.247788          0.397260         0.00        0.068493   0.106845   

   free sulfur dioxide  total sulfur dioxide   density        pH  sulphates  \
0             0.140845              0.098940  0.567548  0.606299   0.137725   
1             0.338028              0.215548  0.494126  0.362205   0.209581   
2             0.197183              0.169611  0.508811  0.409449   0.191617   
3             0.225352              0.190813  0.582232  0.330709   0.149701   
4             0.140845              0.098940  0.567548  0.606299   0.137725   

    alcohol  
0  0.153846  
1  0.215385  
2

# *Task 4: Handling Outliers – Boston Housing Dataset*

**Introduction:** This task involves identifying and treating extreme values (outliers) in the Boston Housing dataset to prevent them from negatively impacting the performance of regression models.

**Technique:** Z-score method to identify outliers and Winsorization (capping) to treat them.

**Reason:** The Z-score method is suitable for identifying data points that are significantly far from the mean. Winsorization is chosen over deletion to preserve data points while reducing the influence of extreme values on the model.

In [20]:
from scipy import stats
import numpy as np
# Data loading
path = '/kaggle/input/competitions/boston-dataset/boston_data.csv' 
df_boston = pd.read_csv(path)

# Select numerical features for outlier detection
numerical_cols = df_boston.select_dtypes(include=[np.number]).columns

# 1. Identify Outliers using Z-score (threshold > 3)
z_scores = np.abs(stats.zscore(df_boston[numerical_cols]))
outlier_mask = (z_scores > 3).any(axis=1)

# 2. Treat Outliers using Winsorization
df_treated = df_boston.copy()
for col in numerical_cols:
    lower_limit = df_treated[col].quantile(0.05)
    upper_limit = df_treated[col].quantile(0.95)
    df_treated[col] = np.where(df_treated[col] < lower_limit, lower_limit,
                               np.where(df_treated[col] > upper_limit, upper_limit, df_treated[col]))

In [21]:
print("\nDescription of features before treatment:")
print(df_boston[numerical_cols].describe())
print("\nDescription of features after Winsorization:")
print(df_treated[numerical_cols].describe())


Description of features before treatment:
             crim          zn       indus        chas         nox         rm  \
count  404.000000  404.000000  404.000000  404.000000  404.000000  404.00000   
mean     3.730912   10.509901   11.189901    0.069307    0.556710    6.30145   
std      8.943922   22.053733    6.814909    0.254290    0.117321    0.67583   
min      0.006320    0.000000    0.460000    0.000000    0.392000    3.56100   
25%      0.082382    0.000000    5.190000    0.000000    0.453000    5.90275   
50%      0.253715    0.000000    9.795000    0.000000    0.538000    6.23050   
75%      4.053158   12.500000   18.100000    0.000000    0.631000    6.62925   
max     88.976200   95.000000   27.740000    1.000000    0.871000    8.78000   

              age         dis         rad         tax     ptratio       black  \
count  404.000000  404.000000  404.000000  404.000000  404.000000  404.000000   
mean    68.601733    3.799666    9.836634  411.688119   18.444554  355.068

# *Task 5: Data Imputation (Advanced ) – Retail Sales Dataset*

**Introduction:** This task involves filling in missing numerical values in the Retail Sales dataset using advanced statistical imputation techniques rather than simple mean/median methods.

**Technique:** K-Nearest Neighbors (KNN) Imputation and MICE (Multivariate Imputation by Chained Equations).

**Reason:** Advanced techniques like KNN and MICE model relationships between features to make more accurate imputations, reducing bias compared to simple imputation

In [22]:
from sklearn.impute import KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Data loading - replace with your copied path from the sidebar
path = '/kaggle/input/datasets/mohammadtalib786/retail-sales-dataset/retail_sales_dataset.csv' 
df_retail = pd.read_csv(path)

# Display missing values before imputation
print("Missing values before imputation:")
print(df_retail.isnull().sum())

# Select numerical columns for imputation
numerical_cols = df_retail.select_dtypes(include=['float64', 'int64']).columns
df_numeric = df_retail[numerical_cols].copy()

# 1. KNN Imputation
knn_imputer = KNNImputer(n_neighbors=5)
df_knn = pd.DataFrame(knn_imputer.fit_transform(df_numeric), columns=numerical_cols)

# 2. MICE Imputation
mice_imputer = IterativeImputer(random_state=0)
df_mice = pd.DataFrame(mice_imputer.fit_transform(df_numeric), columns=numerical_cols)

# Display missing values after imputation to confirm
print("\nMissing values after KNN imputation:")
print(df_knn.isnull().sum())
print("\nMissing values after MICE imputation:")
print(df_mice.isnull().sum())

Missing values before imputation:
Transaction ID      0
Date                0
Customer ID         0
Gender              0
Age                 0
Product Category    0
Quantity            0
Price per Unit      0
Total Amount        0
dtype: int64

Missing values after KNN imputation:
Transaction ID    0
Age               0
Quantity          0
Price per Unit    0
Total Amount      0
dtype: int64

Missing values after MICE imputation:
Transaction ID    0
Age               0
Quantity          0
Price per Unit    0
Total Amount      0
dtype: int64


# *Task 6: Feature Engineering – Heart Disease Dataset*

**Introduction:** This task involves creating new, meaningful features from existing data to help a machine learning model capture patterns more effectively.

**Technique:** Creating derived categorical features: "Age Group" and "Cholesterol Level".

**Reason:** Raw continuous values like age and cholesterol can be grouped into clinically relevant categories, which helps simplify complex relationships for certain types of models

In [23]:
path = '/kaggle/input/datasets/johnsmith88/heart-disease-dataset/heart.csv' 
df_heart = pd.read_csv(path)

# Display original data sample
print("Original features (Age and Chol):")
print(df_heart[['age', 'chol']].head())

# 1. Derived Feature: Age Groups
def get_age_group(age):
    if age < 40: return 'Young'
    elif age <= 60: return 'Middle-Aged'
    else: return 'Senior'

df_heart['age_group'] = df_heart['age'].apply(get_age_group)

# 2. Derived Feature: Cholesterol Categories
def get_chol_category(chol):
    if chol < 200: return 'Normal'
    elif chol < 240: return 'Borderline'
    else: return 'High'

df_heart['chol_category'] = df_heart['chol'].apply(get_chol_category)

# Display Results
print("\nNew Derived Features:")
print(df_heart[['age', 'age_group', 'chol', 'chol_category']].head())

print("\nDistribution of New Features:")
print(df_heart['age_group'].value_counts())
print(df_heart['chol_category'].value_counts())

Original features (Age and Chol):
   age  chol
0   52   212
1   53   203
2   70   174
3   61   203
4   62   294

New Derived Features:
   age    age_group  chol chol_category
0   52  Middle-Aged   212    Borderline
1   53  Middle-Aged   203    Borderline
2   70       Senior   174        Normal
3   61       Senior   203    Borderline
4   62       Senior   294          High

Distribution of New Features:
age_group
Middle-Aged    696
Senior         272
Young           57
Name: count, dtype: int64
chol_category
High          517
Borderline    339
Normal        169
Name: count, dtype: int64


# *Task 7: Variable Transformation – Bike Sharing Dataset*

**Introduction:** This task involves applying mathematical transformations to numerical features in the Bike Sharing dataset to handle skewed data and stabilize variance.

**Technique:** Log Transformation and Box-Cox Transformation.

**Reason:** Many machine learning algorithms assume that features follow a normal (Gaussian) distribution. Log and Box-Cox transformations help "un-skew" data that is clustered on one side, making the relationship between variables more linear and improving model performance.

In [24]:
path = '/kaggle/input/datasets/yasserh/bike-sharing-dataset/day.csv' 
df_bike = pd.read_csv(path)

# 1. Original Skewness
original_skew = df_bike['cnt'].skew()
print(f"Original Skewness of 'cnt': {original_skew:.4f}")

# 2. Log Transformation (using log1p to handle 0s)
df_bike['cnt_log'] = np.log1p(df_bike['cnt'])

# 3. Box-Cox Transformation (requires positive values)
df_bike['cnt_boxcox'], _ = stats.boxcox(df_bike['cnt'])

# 4. Square Root Transformation
df_bike['cnt_sqrt'] = np.sqrt(df_bike['cnt'])

# Display Results
print("\n--- Skewness Comparison ---")
print(f"Log Transform: {df_bike['cnt_log'].skew():.4f}")
print(f"Box-Cox Transform: {df_bike['cnt_boxcox'].skew():.4f}")
print(f"Square Root Transform: {df_bike['cnt_sqrt'].skew():.4f}")

print("\n--- Transformed Data (First 5 Rows) ---")
print(df_bike[['cnt', 'cnt_log', 'cnt_boxcox', 'cnt_sqrt']].head())

Original Skewness of 'cnt': -0.0496

--- Skewness Comparison ---
Log Transform: -1.9154
Box-Cox Transform: -0.1544
Square Root Transform: -0.5794

--- Transformed Data (First 5 Rows) ---
    cnt   cnt_log  cnt_boxcox   cnt_sqrt
0   985  6.893656  505.996549  31.384710
1   801  6.687109  421.088686  28.301943
2  1349  7.207860  668.973663  36.728735
3  1562  7.354362  761.935717  39.522146
4  1600  7.378384  778.363243  40.000000


# *Task 8: Feature Selection – Diabetes Dataset*

**Introduction:** This task focuses on identifying the most relevant features in the Pima Indians Diabetes dataset to reduce model complexity and improve performance.

**Technique:** Correlation Analysis and Chi-Square Test.

**Reason:** Correlation Analysis helps identify linear relationships between numerical features, while the Chi-Square test evaluates the independence of categorical features relative to the target, allowing us to drop redundant or irrelevant data.

In [25]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_selection import SelectKBest, chi2

path = '/kaggle/input/datasets/akshaydattatraykhare/diabetes-dataset/diabetes.csv' 
df_diabetes = pd.read_csv(path)

# 1. Correlation Analysis
correlation_matrix = df_diabetes.corr()
print("Correlation with Target (Outcome):")
print(correlation_matrix['Outcome'].sort_values(ascending=False))

# 2. Chi-Square Test for Feature Selection
X = df_diabetes.drop(columns=['Outcome'])
y = df_diabetes['Outcome']

# Apply SelectKBest to extract top 4 best features
bestfeatures = SelectKBest(score_func=chi2, k=4)
fit = bestfeatures.fit(X, y)

# Create a dataframe to view scores
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
featureScores = pd.concat([dfcolumns, dfscores], axis=1)
featureScores.columns = ['Feature', 'Score']

print("\n--- Top 4 Features via Chi-Square Test ---")
print(featureScores.nlargest(4, 'Score'))

Correlation with Target (Outcome):
Outcome                     1.000000
Glucose                     0.466581
BMI                         0.292695
Age                         0.238356
Pregnancies                 0.221898
DiabetesPedigreeFunction    0.173844
Insulin                     0.130548
SkinThickness               0.074752
BloodPressure               0.065068
Name: Outcome, dtype: float64

--- Top 4 Features via Chi-Square Test ---
   Feature        Score
4  Insulin  2175.565273
1  Glucose  1411.887041
7      Age   181.303689
5      BMI   127.669343


# *Task 9: Handling/Dealing w Imbalanced Data – Credit Card Fraud Detection*

**Introduction:** This task addresses the extreme class imbalance in fraud detection datasets, where legitimate transactions far outnumber fraudulent ones.

**Technique:** Random Undersampling and SMOTE (Synthetic Minority Over-sampling Technique).

**Reason:** Models trained on imbalanced data often ignore the minority class (fraud). Undersampling balances the classes by removing majority samples, while SMOTE creates synthetic fraudulent samples to provide the model with more patterns to learn from

In [26]:
from imblearn.over_sampling import SMOTE
from sklearn.utils import resample

path = '/kaggle/input/datasets/organizations/mlg-ulb/creditcardfraud/creditcard.csv' 
df_fraud = pd.read_csv(path)
df_fraud.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [27]:

# Display original class distribution
print("Original Class Distribution:")
print(df_fraud['Class'].value_counts())

# Separate majority and minority classes
df_majority = df_fraud[df_fraud.Class == 0]
df_minority = df_fraud[df_fraud.Class == 1]

# Downsample majority class to match minority class size
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    
                                 n_samples=len(df_minority),
                                 random_state=42)

df_undersampled = pd.concat([df_majority_downsampled, df_minority])

print("\nClass Distribution after Undersampling:")
print(df_undersampled['Class'].value_counts())

# 2. SMOTE (Oversampling)
X = df_fraud.drop('Class', axis=1)
y = df_fraud['Class']

smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X, y)

print("\nClass Distribution after SMOTE:")
print(pd.Series(y_smote).value_counts())

Original Class Distribution:
Class
0    284315
1       492
Name: count, dtype: int64

Class Distribution after Undersampling:
Class
0    492
1    492
Name: count, dtype: int64

Class Distribution after SMOTE:
Class
0    284315
1    284315
Name: count, dtype: int64


# *Task 10: Combining Multiple Datasets – MovieLens Dataset*

**Introduction:** This task involves integrating three separate data sources (Movies, Ratings, and Users) from the MovieLens dataset into a unified structure for analysis.

**Technique:** Inner Merging (Joining) on common keys like movieId and userId.

**Reason:** Merging allows us to relate user demographics (age/gender) to specific movie ratings and genres, which is essential for building recommendation systems or performing targeted data analysis.

In [28]:
path_movies = '/kaggle/input/datasets/organizations/grouplens/movielens-20m-dataset/movie.csv'
path_ratings = '/kaggle/input/datasets/organizations/grouplens/movielens-20m-dataset/rating.csv'

# 1. Load the individual datasets
df_movies = pd.read_csv(path_movies)
df_ratings = pd.read_csv(path_ratings)

# 2. Merge Ratings with Movie Metadata
# Common key is 'movieId'
df_combined = pd.merge(df_ratings, df_movies, on='movieId', how='inner')

# Display Output
print("--- Movie Data Schema ---")
print(df_movies.columns)

print("\n--- Ratings Data Schema ---")
print(df_ratings.columns)

print("\n--- Combined Dataset (First 5 Rows) ---")
print(df_combined[['userId', 'movieId', 'rating', 'title', 'genres']].head())

print("\nShape of Combined Dataset:", df_combined.shape)

--- Movie Data Schema ---
Index(['movieId', 'title', 'genres'], dtype='object')

--- Ratings Data Schema ---
Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')

--- Combined Dataset (First 5 Rows) ---
   userId  movieId  rating                                              title  \
0       1        2     3.5                                     Jumanji (1995)   
1       1       29     3.5  City of Lost Children, The (Cité des enfants p...   
2       1       32     3.5          Twelve Monkeys (a.k.a. 12 Monkeys) (1995)   
3       1       47     3.5                        Seven (a.k.a. Se7en) (1995)   
4       1       50     3.5                         Usual Suspects, The (1995)   

                                   genres  
0              Adventure|Children|Fantasy  
1  Adventure|Drama|Fantasy|Mystery|Sci-Fi  
2                 Mystery|Sci-Fi|Thriller  
3                        Mystery|Thriller  
4                  Crime|Mystery|Thriller  

Shape of Combined Dataset: (20

# *Task 11: Dimensionality Reduction – MNIST Dataset*

**Introduction:** This task focuses on reducing the dimensionality of the MNIST dataset (which has 784 pixel features per image) to a smaller set of features while retaining as much variance as possible.

**Technique:** Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE).

**Reason:** Reducing dimensions speeds up machine learning training times, mitigates the "curse of dimensionality," and helps visualize high-dimensional data in 2D or 3D spaces.

In [29]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Data loading - Replace with your copied path for train.csv
path = '/kaggle/input/datasets/oddrationale/mnist-in-csv/mnist_train.csv' 
df_mnist = pd.read_csv(path)

# Separate features (pixels) and target (label)
X = df_mnist.drop(columns=['label'])
y = df_mnist['label']

# 1. Principal Component Analysis (PCA)
# Reduce to 50 components to retain high variance
pca = PCA(n_components=50)
X_pca = pca.fit_transform(X)

print("--- PCA Results ---")
print(f"Original shape: {X.shape}")
print(f"Reduced shape: {X_pca.shape}")
print(f"Explained variance ratio (top 50 components): {np.sum(pca.explained_variance_ratio_):.4f}")

# 2. t-SNE (Basic Visualization)
# Reduce to 2 components for visualization (subsetting data for speed)
tsne = TSNE(n_components=2, random_state=42, n_iter=300)
X_tsne = tsne.fit_transform(X[:2000]) # Using first 2000 images for faster output

print("\n--- t-SNE Results ---")
print(f"Original shape: {X[:2000].shape}")
print(f"Reduced shape: {X_tsne.shape}")

--- PCA Results ---
Original shape: (60000, 784)
Reduced shape: (60000, 50)
Explained variance ratio (top 50 components): 0.8246





--- t-SNE Results ---
Original shape: (2000, 784)
Reduced shape: (2000, 2)


# *Task 12: Text Preprocessing – IMDB Movie Reviews Dataset*


**Introduction:** This task involves cleaning and formatting unstructured text data (movie reviews) so it can be converted into numerical vectors for machine learning models.

**Technique:** Lowercasing, stopword removal, tokenization, and stemming.

**Reason:** Unprocessed text contains noise (punctuation, capitalization) and redundant words (a, the, is) that do not add predictive value; cleaning reduces the feature space and improves model accuracy.

In [30]:


# Data loading 
path = '/kaggle/input/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows/imdb_top_1000.csv'
df_imdb = pd.read_csv(path)

df_imdb.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


In [31]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer


# Download required NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Initialize tools
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [32]:
# Text Preprocessing Function

def preprocess_text(text):
    text = text.lower()                              # Lowercasing
    text = re.sub(r'[^a-zA-Z\s]', '', text)          # Remove special characters
    tokens = word_tokenize(text)                     # Tokenization
    tokens = [word for word in tokens if word not in stop_words]  # Stopword removal
    tokens = [stemmer.stem(word) for word in tokens] # Stemming
    return " ".join(tokens)

# Apply Preprocessing
df_imdb['cleaned_review'] = df_imdb['Overview'].apply(preprocess_text)

# Result Check
df_imdb[['Overview', 'cleaned_review']].head()

Unnamed: 0,Overview,cleaned_review
0,Two imprisoned men bond over a number of years...,two imprison men bond number year find solac e...
1,An organized crime dynasty's aging patriarch t...,organ crime dynasti age patriarch transfer con...
2,When the menace known as the Joker wreaks havo...,menac known joker wreak havoc chao peopl gotha...
3,The early life and career of Vito Corleone in ...,earli life career vito corleon new york citi p...
4,A jury holdout attempts to prevent a miscarria...,juri holdout attempt prevent miscarriag justic...


# *Task 13: Time-Series Preprocessing – Air Quality Dataset*

**Introduction:** This task involves cleaning and preparing time-series data, which has a specific temporal ordering, for analysis.

**Technique:** Handling missing timestamps (index setting), Resampling, and Smoothing (Rolling Mean).

**Reason:** Time-series data often has irregular intervals or gaps; resampling forces a regular frequency (e.g., daily), while smoothing reduces noise to highlight underlying trends.

In [33]:

path = '/kaggle/input/datasets/fedesoriano/air-quality-data-set/AirQuality.csv'
df_air = pd.read_csv(path, sep=';', decimal=',')

# Convert Date and Time columns to a single Datetime index
df_air['DateTime'] = pd.to_datetime(df_air['Date'] + ' ' + df_air['Time'], format='%d/%m/%Y %H.%M.%S', errors='coerce')
df_air.set_index('DateTime', inplace=True)

# Dropping rows with completely missing datetimes
df_air.dropna(subset=['CO(GT)'], inplace=True)

print("Original Time Series (First 5 Rows):")
print(df_air[['CO(GT)']].head())

# 1. Resampling (Downsample to Daily Mean)
df_daily = df_air['CO(GT)'].resample('D').mean()
print("\n--- Daily Resampled Data (First 5 Rows) ---")
print(df_daily.head())

# 2. Smoothing (Rolling Mean - 7 Day Window)
df_smoothed = df_daily.rolling(window=7, min_periods=1).mean()
print("\n--- 7-Day Rolling Mean (First 5 Rows) ---")
print(df_smoothed.head())

Original Time Series (First 5 Rows):
                     CO(GT)
DateTime                   
2004-03-10 18:00:00     2.6
2004-03-10 19:00:00     2.0
2004-03-10 20:00:00     2.2
2004-03-10 21:00:00     2.2
2004-03-10 22:00:00     1.6

--- Daily Resampled Data (First 5 Rows) ---
DateTime
2004-03-10     1.966667
2004-03-11    -6.187500
2004-03-12   -14.095833
2004-03-13    -5.750000
2004-03-14    -5.966667
Freq: D, Name: CO(GT), dtype: float64

--- 7-Day Rolling Mean (First 5 Rows) ---
DateTime
2004-03-10    1.966667
2004-03-11   -2.110417
2004-03-12   -6.105556
2004-03-13   -6.016667
2004-03-14   -6.006667
Freq: D, Name: CO(GT), dtype: float64
