# Duplicates
* sklearn
* Scaling

### duplicates
* duplicates means copy of the same value 

### effect 
* mean/ median values
* statistical analysis
* bias issues 

## Identify duplicates
* df.duplicated().any() -> Boolean
* df.duplicated().sum()
* df.duplicated(subset='CN') ->  df.loc 

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

In [36]:
df = pd.DataFrame({"A": [1, 2, 6, 1, 4, 4], "B": [1, 6, 6, 1, 4, 4]})
print(df)

   A  B
0  1  1
1  2  6
2  6  6
3  1  1
4  4  4
5  4  4


In [29]:
df.duplicated()

0    False
1    False
2    False
3     True
4    False
5     True
dtype: bool

In [30]:
df.duplicated().sum()

np.int64(2)

In [31]:
df.loc[df.duplicated(subset="A"),"A"]

3    1
5    4
Name: A, dtype: int64

In [32]:
df.loc[df.duplicated(subset="B"),"B"]

2    6
3    1
5    4
Name: B, dtype: int64

# Handling/remove duplicates

### Handling duplicates
* drop
* df.drop_duplicates()

In [27]:
df.drop_duplicates()

Unnamed: 0,A,B
0,1,1
1,2,6
2,6,6
4,4,4


In [35]:
df.drop_duplicates(keep="last")

Unnamed: 0,A,B
1,2,6
2,6,6
3,1,1
5,4,4


In [39]:
df.drop_duplicates(subset="A", keep="last")

Unnamed: 0,A,B
1,2,6
2,6,6
3,1,1
5,4,4


In [40]:
df=pd.read_csv('social_media.csv')
df

Unnamed: 0,age,gender,job_type,daily_social_media_time,social_platform_preference,number_of_notifications,work_hours_per_day,perceived_productivity_score,actual_productivity_score,stress_level,sleep_hours,screen_time_before_sleep,breaks_during_work,uses_focus_apps,has_digital_wellbeing_enabled,coffee_consumption_per_day,days_feeling_burnout_per_month,weekly_offline_hours,job_satisfaction_score
0,56,Male,Unemployed,4.180940,Facebook,61,6.753558,8.040464,7.291555,4.0,5.116546,0.419102,8,False,False,4,11,21.927072,6.336688
1,46,Male,Health,3.249603,Twitter,59,9.169296,5.063368,5.165093,7.0,5.103897,0.671519,7,True,True,2,25,0.000000,3.412427
2,32,Male,Finance,,Twitter,57,7.910952,3.861762,3.474053,4.0,8.583222,0.624378,0,True,False,3,17,10.322044,2.474944
3,60,Female,Unemployed,,Facebook,59,6.355027,2.916331,1.774869,6.0,6.052984,1.204540,1,False,False,0,4,23.876616,1.733670
4,25,Male,IT,,Telegram,66,6.214096,8.868753,,7.0,5.405706,1.876254,1,False,True,1,30,10.653519,9.693060
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,34,Female,Health,1.877297,Facebook,59,10.226358,3.348512,3.465815,8.0,5.480462,1.412655,9,False,False,4,5,21.776927,
29996,39,Male,Health,4.437784,Instagram,46,4.692862,8.133213,6.659294,8.0,3.045393,0.148936,3,False,False,1,29,4.111370,6.155613
29997,42,Male,Education,17.724981,TikTok,64,10.915036,8.611005,8.658912,5.0,5.491520,1.224296,10,False,False,1,2,1.888315,6.285237
29998,20,Female,Education,3.796634,Instagram,56,6.937410,7.767076,6.895583,8.0,6.816069,0.234483,1,False,False,2,9,12.511871,7.854711


In [42]:
df.duplicated().sum()

np.int64(0)

## Scikit-Learn (sklearn) 
### Introduction
* Scikit-learn (sklearn) is a popular Python library for machine learning. It provides simple and efficient tools for data analysis and modeling, built on top of NumPy, Pandas, and SciPy.

### Why Use Scikit-Learn

* Easy to use API (fit, transform, predict)
* Supports most classical ML algorithms
* Excellent documentation
* Seamless integration with pipelines and preprocessing
* Widely used in industry and academia

### Core Modules

* Preprocessing: Scaling, encoding, missing values
* Model Selection: train_test_split, cross-validation, GridSearchCV
* Models: Linear models, SVM, KNN, Trees, Ensembles
* Metrics: Accuracy, precision, recall, R², RMSE
* Pipelines: Combine preprocessing and models

### Typical Machine Learning Workflow

* Load data
* Split into train and test sets
* Preprocess features
* Train model
* Make predictions
* Evaluate performance

### Pipelines
* Pipelines allow chaining preprocessing steps and models into a single object. This prevents data leakage and improves reproducibility.

### Hyperparameter Tuning
* Scikit-learn provides GridSearchCV and RandomizedSearchCV to automatically search for the best model parameters using cross-validation.

### Advantages and Limitations
* Advantages: Easy, fast, reliable, wide algorithm support
* Limitations: Not suitable for deep learning or very large-scale GPU workloads

### sklearn vs Deep Learning Libraries
* Scikit-learn is ideal for classical ML problems and structured data, while TensorFlow and PyTorch are used for deep learning and unstructured data such as images, audio, and text.

## Feature Scaing 
* features means columns only 
* scaling means making all numeric data into equal range
* load the dataset -> EDA -> preprocessing -> numerical and categorical -> feature scaling -> train-test
* Numerical columns -> with high magnitude

## Three Feature scaling techniques
* standard scalar -> standardization -> Z Score
* min-max scalar -> Normalization
* Robust Scalar -> (Outlier&IQR) -> (Median)

![image.png](attachment:e8e6f02f-7dae-4e02-9722-d82ffd84c66b.png)

In [46]:
df.head(2)

Unnamed: 0,age,gender,job_type,daily_social_media_time,social_platform_preference,number_of_notifications,work_hours_per_day,perceived_productivity_score,actual_productivity_score,stress_level,sleep_hours,screen_time_before_sleep,breaks_during_work,uses_focus_apps,has_digital_wellbeing_enabled,coffee_consumption_per_day,days_feeling_burnout_per_month,weekly_offline_hours,job_satisfaction_score
0,56,Male,Unemployed,4.18094,Facebook,61,6.753558,8.040464,7.291555,4.0,5.116546,0.419102,8,False,False,4,11,21.927072,6.336688
1,46,Male,Health,3.249603,Twitter,59,9.169296,5.063368,5.165093,7.0,5.103897,0.671519,7,True,True,2,25,0.0,3.412427


In [50]:
from sklearn.preprocessing import StandardScaler
scal =StandardScaler() # intialization
df[['age']]=scal.fit_transform(df[['age']])

In [51]:
df.head(2)

Unnamed: 0,age,gender,job_type,daily_social_media_time,social_platform_preference,number_of_notifications,work_hours_per_day,perceived_productivity_score,actual_productivity_score,stress_level,sleep_hours,screen_time_before_sleep,breaks_during_work,uses_focus_apps,has_digital_wellbeing_enabled,coffee_consumption_per_day,days_feeling_burnout_per_month,weekly_offline_hours,job_satisfaction_score
0,1.049017,Male,Unemployed,4.18094,Facebook,61,6.753558,8.040464,7.291555,4.0,5.116546,0.419102,8,False,False,4,11,21.927072,6.336688
1,0.326212,Male,Health,3.249603,Twitter,59,9.169296,5.063368,5.165093,7.0,5.103897,0.671519,7,True,True,2,25,0.0,3.412427


In [53]:
from sklearn.preprocessing import MinMaxScaler
scal =MinMaxScaler() # intialization
df[['daily_social_media_time']]=scal.fit_transform(df[['daily_social_media_time']])

In [54]:
df.head(2)

Unnamed: 0,age,gender,job_type,daily_social_media_time,social_platform_preference,number_of_notifications,work_hours_per_day,perceived_productivity_score,actual_productivity_score,stress_level,sleep_hours,screen_time_before_sleep,breaks_during_work,uses_focus_apps,has_digital_wellbeing_enabled,coffee_consumption_per_day,days_feeling_burnout_per_month,weekly_offline_hours,job_satisfaction_score
0,1.049017,Male,Unemployed,0.23262,Facebook,61,6.753558,8.040464,7.291555,4.0,5.116546,0.419102,8,False,False,4,11,21.927072,6.336688
1,0.326212,Male,Health,0.180802,Twitter,59,9.169296,5.063368,5.165093,7.0,5.103897,0.671519,7,True,True,2,25,0.0,3.412427


In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 19 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   age                             30000 non-null  float64
 1   gender                          30000 non-null  object 
 2   job_type                        30000 non-null  object 
 3   daily_social_media_time         27235 non-null  float64
 4   social_platform_preference      30000 non-null  object 
 5   number_of_notifications         30000 non-null  int64  
 6   work_hours_per_day              30000 non-null  float64
 7   perceived_productivity_score    28386 non-null  float64
 8   actual_productivity_score       27635 non-null  float64
 9   stress_level                    28096 non-null  float64
 10  sleep_hours                     27402 non-null  float64
 11  screen_time_before_sleep        27789 non-null  float64
 12  breaks_during_work              

In [57]:
num_cols=df.select_dtypes(include=['int64','float64']).columns

In [58]:
print(num_cols)

Index(['age', 'daily_social_media_time', 'number_of_notifications',
       'work_hours_per_day', 'perceived_productivity_score',
       'actual_productivity_score', 'stress_level', 'sleep_hours',
       'screen_time_before_sleep', 'breaks_during_work',
       'coffee_consumption_per_day', 'days_feeling_burnout_per_month',
       'weekly_offline_hours', 'job_satisfaction_score'],
      dtype='object')


In [60]:
from sklearn.preprocessing import MinMaxScaler
scal =MinMaxScaler() 
df[num_cols]=scal.fit_transform(df[num_cols])

In [61]:
df.head(2)

Unnamed: 0,age,gender,job_type,daily_social_media_time,social_platform_preference,number_of_notifications,work_hours_per_day,perceived_productivity_score,actual_productivity_score,stress_level,sleep_hours,screen_time_before_sleep,breaks_during_work,uses_focus_apps,has_digital_wellbeing_enabled,coffee_consumption_per_day,days_feeling_burnout_per_month,weekly_offline_hours,job_satisfaction_score
0,0.808511,Male,Unemployed,0.23262,Facebook,0.516667,0.562797,0.862995,0.732476,0.333333,0.302364,0.139701,0.8,False,False,0.4,0.354839,0.535267,0.633669
1,0.595745,Male,Health,0.180802,Twitter,0.483333,0.764108,0.437643,0.509797,0.666667,0.300557,0.22384,0.7,True,True,0.2,0.806452,0.0,0.341243
