# Categorical data standardisation and normalisation

### One hot encoding
### Label Encoding

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
titanic_df = pd.read_csv("Datasets/Titanic.csv")
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
titanic_df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

## Predict whether the person survived or not based on the historic information 
- will be discussed in machine learning
- lets do the analysis

In [7]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## types of categorical variable
- binary
- nominal (no order)
- ordinal (order)

- here name will be nominal, gender - binary (M/F), ticket_id - nominal, embarked - nominal (C - charles town cities)

- when there is heierarchy one greater than other is called ordinal

### how many male and female passengers ?

In [9]:
titanic_df['Sex'].value_counts()

Sex
male      577
female    314
Name: count, dtype: int64

### this indicates that there are more male passengers than female

## How many survived?

In [10]:
titanic_df['Survived'].value_counts()

Survived
0    549
1    342
Name: count, dtype: int64

## How many male passengers survived and how many female?

In [13]:
titanic_df.groupby('Survived').agg({'Sex': 'value_counts'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Sex
Survived,Sex,Unnamed: 2_level_1
0,male,468
0,female,81
1,female,233
1,male,109


Female passengers survived more than the male passengers in the crash

# One - hot encoding

- lets perform on column Sex
- will generate two columns Sex_female and Sex_male
- pd.get_dummies()
- if u give drop first it means one column out of the level be removed
- so only sex_male remains

- if n columns (n - 1) columns will be remained as encoded values

In [18]:
pd.get_dummies(data=titanic_df, columns=['Sex', 'Embarked'], drop_first=True)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Sex_male,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.2500,,True,False,True
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,False,False,False
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.9250,,False,False,True
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1000,C123,False,False,True
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.0500,,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",27.0,0,0,211536,13.0000,,True,False,True
887,888,1,1,"Graham, Miss. Margaret Edith",19.0,0,0,112053,30.0000,B42,False,False,True
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",,1,2,W./C. 6607,23.4500,,False,False,True
889,890,1,1,"Behr, Mr. Karl Howell",26.0,0,0,111369,30.0000,C148,True,False,False


## Label Encoding

In [20]:
from sklearn.preprocessing import LabelEncoder

In [21]:
le_encoder = LabelEncoder()

In [24]:
titanic_df['Encoded_Gender'] = le_encoder.fit_transform(titanic_df['Sex'])
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Encoded_Gender
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1


In [25]:
titanic_df['Encoded_Embark'] = le_encoder.fit_transform(titanic_df['Embarked'])

In [26]:
titanic_df['Encoded_Embark'].value_counts()

Encoded_Embark
2    644
0    168
1     77
3      2
Name: count, dtype: int64

In [30]:
titanic_df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

We had to remove the nan values because two were nan

# Steps of handling a data

## Input to model
1. Define the problem statement 
    - identify what you will need to achieve
2. Collect the data 
    - sampling
    - surveying
3. Preparing the data
    - reading the data
        - .read_csv()
    - Checking the data dimensions
        - .shape
    - check the data types
        - .info
    - check the records
        - .head()
        - .columns
    - checking the summary
        - .describe()
    - data visualisation
        - distibution
            - univariate
                - histogram
                - boxplot
            -bivariate
                - line
                - scatter
            - multivariate analysis
                - pie
                - heatmap
    - data preporcessing
        - handling missing data point
            - isnull().sum()
            - fillna()
        - handling outliers
            - outlier treatment
        - removing duplicates
            - duplicated()
    - data standardisation and normalisation
        - numeric variable
            - min max scaler
            - standard scaler
            - binning (categorisation of numeric into categorical)
        - categorical
            - one hot encoding
            - label encoding

### build model (output of step 3 will be input to the next step)
4. Build the model
        


### How to handle date variable?

In [31]:
sample_data = {
    'Date': ['2025-03-22', '2025-03-23', '2025-03-24']
}

In [36]:
sample_df = pd.DataFrame(sample_data)
sample_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Date    3 non-null      object
dtypes: object(1)
memory usage: 156.0+ bytes


Here date is currently string lets convert to date

In [38]:
sample_df['Converted_date'] = pd.to_datetime(sample_df['Date'])
sample_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Date            3 non-null      object        
 1   Converted_date  3 non-null      datetime64[ns]
dtypes: datetime64[ns](1), object(1)
memory usage: 180.0+ bytes


In [39]:
sample_df

Unnamed: 0,Date,Converted_date
0,2025-03-22,2025-03-22
1,2025-03-23,2025-03-23
2,2025-03-24,2025-03-24


## If i wnt to convert the format

In [46]:
sample_data2 = {
    'Date': ['22-03-2025', '23-03-2025', '24-03-2025']
}
sample_df2 = pd.DataFrame(sample_data2)
sample_df2["Converted_date"] = pd.to_datetime(sample_df2["Date"], format='%d-%m-%Y')

In [47]:
sample_df2

Unnamed: 0,Date,Converted_date
0,22-03-2025,2025-03-22
1,23-03-2025,2025-03-23
2,24-03-2025,2025-03-24


In [50]:
sample_df2["Month"] = sample_df2['Converted_date'].dt.month
sample_df2["Day"] = sample_df2['Converted_date'].dt.day
sample_df2

Unnamed: 0,Date,Converted_date,Month,Day
0,22-03-2025,2025-03-22,3,22
1,23-03-2025,2025-03-23,3,23
2,24-03-2025,2025-03-24,3,24


# Exploratory data analysis

1. Data preprocessing
    - missing
    - outlier
    - duplicates
2. data wrangling
    - standardisation and normalisation
3. data visualisation
    - univariate
    - bivariate
    - multivariate

# Statistical analysis

Statistics - study of data
### Types of statistics
- descriptive
    - measures of central tendency (mean, median, mode)
    - skewness
- Inferential
    - inference something from data
    - hypothesis testing
        - statistical tests
        - hypothesis is a statement what we claim
        - eg: average of data is 5000 is my hypothesis because its for my sample, now we need to verify for entire population
        - null hypothesis here is average is 500
        - alternate is its not always average ( what we do not accept)
        - then we say which one is true (statistical tests - to convince people what we say about data is true)

        - tomorrow rains ( not all accepts) -> hypothesis
        - alternate not rains
        - convince people with data
---------------
- probability
    - likelihood of an event
    - probability of india winning a world cup

- linear algebra
    - to study the relationship of variables
    - how is x related to y
    - system of linear variables

    - x - time spend learning
    - y - time spend practising
    - z - time taken rest

    - 2x + 3y + 4z = 5
    - 3x + 5y + 7z = 9
    - 9x + 3y + 2z = 11

    - find optimum time for learning, practising and rest to get the marks
    - matrix operations come here


probability distribution
- function that gives the probability of a different outcome that can happen in the trial or the experiment
- distribution based on the probability values is called as probabilistic distribution

types of probability distribution
- dice
    - bar chart
    - all have equal chances of occuring
    - uniform distribution
- bernoulli distribution

types of numeric variables
- discrete
    - countable
    - fixed set of values
    - whole numbers
    - eg: rolling dice, coin flip
    - probability distribution will be different for discrete and continous
    - types
        - bernoulli
            - when ua expecting 2 outcomes in scenario
            - when u toss a coin probability of getting a head when 1 coin is tossed
            - only one event is there
        - poisson
            - more like what is the probability of an event happening over a period of time
            - what s the probability u will get 10 emails in next one hour
                - event time space = 1 hour
                - we need to find the event count
                - probabillity over time
        - binomial
            - repeated trials on same thing
            - probability that ill get 3 heads in 5 tosses
            - we are not taking 5 seperate coins (same thing)
- continous
    - any value in a range 
    - height, weight, temp
    - types
        - normal or gaussian distribution (bell shaped curve)
            - probability of a person having a particular height
        - exponential
            - time until your next bus arrives
        - uniform
            - uniform occurences
            - probability of getting 5 in random numbers
            - probability of getting 5 in the dice
            - 1/5 = 0.something ( not a discrete variable that we get)

- 70% chance of working
what is the probability of the machine working in a given day?
bernoulli distribution
0.7 is the answer
bernoulli distribution formula
p(X=x) = (p^x)*((1-p)^(1-x))
       = (0.7)^1 ->probability of it wokring
          (1 - 0.7)^ 1-1 ->probability of not wokting
        = 0.7 * 1
        = 0.7


- binomial distrbution
formula of getting x getting a value
P(X=k) = (nk(i shud check this)) (p^k) (1-p)^(n-k)

eg: probability of getting 3 head in 5 trials

n = 5 k = 3 p=0.5

= 5!/(3!*(5-3)!)
= 5!/(3! *3!) * (0.5)^3 * (0.5)^2
