## Import the Librabries

In [1]:
import pandas as pd

`Pandas has 2 core objects:`

`Series ‚Üí 1D (like a single column)`

`DataFrame ‚Üí 2D table (rows + columns) ‚Üê MOST IMPORTANT`

Think of a DataFrame as:

`Excel sheet + Python superpowers`

## Task 1 : Create a Dataset

In [17]:
df = {'name':['Ayaan', 'Riya', 'Kabir'], 
      'age': [23, 25,22],
      'city': ['Mumbai', 'Delhi', 'Bengaluru']}

In [18]:
df = pd.DataFrame(df)

In [19]:
df

Unnamed: 0,name,age,city
0,Ayaan,23,Mumbai
1,Riya,25,Delhi
2,Kabir,22,Bengaluru


## Task 2 : Check Dataset attributes

In [26]:
df.head()

Unnamed: 0,name,age,city
0,Ayaan,23,Mumbai
1,Riya,25,Delhi
2,Kabir,22,Bengaluru


#### `df.head(n)` Gives the starting n values default 5

In [27]:
df.tail()

Unnamed: 0,name,age,city
0,Ayaan,23,Mumbai
1,Riya,25,Delhi
2,Kabir,22,Bengaluru


#### `df.tail(n)` Gives the ending n values. default 5

In [29]:
df.shape

(3, 3)

#### `df.shape` gives the shape of the dataframe

In [31]:
df.columns

Index(['name', 'age', 'city'], dtype='object')

#### `df.columns` Gives the name of the columns

In [32]:
df.info

<bound method DataFrame.info of     name  age       city
0  Ayaan   23     Mumbai
1   Riya   25      Delhi
2  Kabir   22  Bengaluru>

#### `df.info` Gives the info about the dataframe

## Task 3 : Deal with missing Values

In [62]:
import pandas as pd

data = {
    'experience_years': [1, 3, 5, None, 10],
    'salary_lpa': [3, 6, 9, 11, None],
    'department': ['IT', 'HR', 'IT', 'Finance', 'HR']
}

df = pd.DataFrame(data)
df


Unnamed: 0,experience_years,salary_lpa,department
0,1.0,3.0,IT
1,3.0,6.0,HR
2,5.0,9.0,IT
3,,11.0,Finance
4,10.0,,HR


In [43]:
type(df)

pandas.core.frame.DataFrame

In [45]:
list(df.columns)

['experience_years', 'salary_lpa', 'department']

In [47]:
df.isnull().sum()

experience_years    1
salary_lpa          1
department          0
dtype: int64

In [49]:
df.columns[df.isnull().any()]

Index(['experience_years', 'salary_lpa'], dtype='object')

In [69]:
df['experience_years'].fillna(df['experience_years'].median(), inplace = True)
df['salary_lpa'].fillna(df['salary_lpa'].median(), inplace = True)
df

Unnamed: 0,experience_years,salary_lpa,department
0,1.0,3.0,IT
1,3.0,6.0,HR
2,5.0,9.0,IT
3,4.0,11.0,Finance
4,10.0,7.5,HR


#### 1. Real-world data is rarely symmetric

Especially:
    
    salaries
    
    experience
    
    house prices
    
    incomes
    
    These usually have outliers.

        Example salary data:
        
        3, 5, 6, 7, 8, 50
        
        
        Mean = (3+5+6+7+8+50)/6 = 13.16 ‚ùå (not realistic)
        
        Median = 6.5 ‚úÖ (represents typical value)
        
        üìå Mean gets pulled by extreme values
        üìå Median stays stable
    
#### 2. ML models learn patterns ‚Äî not averages

    If you use mean:
    
    You inject artificially high or low values
    
    Models may learn false relationships
    
    Median preserves the true data distribution better
    

#### 3. | Situation              | Use      |
#### | Skewed data / outliers | ‚úÖ Median |
#### | Clean, symmetric data  | Mean     |
#### | Categorical features   | Mode     |


In [71]:
df.isnull().sum()

experience_years    0
salary_lpa          0
department          0
dtype: int64

## Extras

### Task 1 : Load a Datset and get insights

In [97]:
df = pd.read_csv('car-sales-missing-data.csv')
df.head()

Unnamed: 0,Make,Colour,Odometer,Doors,Price
0,Toyota,White,150043.0,4.0,"$4,000"
1,Honda,Red,87899.0,4.0,"$5,000"
2,Toyota,Blue,,3.0,"$7,000"
3,BMW,Black,11179.0,5.0,"$22,000"
4,Nissan,White,213095.0,4.0,"$3,500"


In [98]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Make      9 non-null      object 
 1   Colour    9 non-null      object 
 2   Odometer  6 non-null      float64
 3   Doors     9 non-null      float64
 4   Price     8 non-null      object 
dtypes: float64(2), object(3)
memory usage: 528.0+ bytes


#### Which columns are categorical?
-> Make, Colour, Price

#### Which columns are numerical?
-> Odometer, Door

#### Which columns would need encoding for ML?
-> Make, Colour

### Task 2 : Handling Missing values

In [99]:
list(df.columns[df.isnull().any()])

['Make', 'Colour', 'Odometer', 'Doors', 'Price']

#### What does any() do?
-> any() checks if at least ONE value is True

In [145]:
df.isnull().any().any()

True

#### Meaning:

    1. first any() ‚Üí per column
    2. second any() ‚Üí across all columns

In [146]:
df.notnull().all()

Make        False
Colour      False
Odometer     True
Doors       False
Price       False
dtype: bool

| Function | Meaning            |
| -------- | ------------------ |
| `any()`  | at least one True  |
| `all()`  | everything is True |


In [100]:
df['Odometer'].fillna(df['Odometer'].median(), inplace = True)
df['Odometer'].isnull().sum()

0

### Task 3 : Remove symbols

In [105]:
df['Price'].str
df['Price'] = df['Price'].str.replace("$",'', regex = False)
df['Price'] = df['Price'].str.replace(",", '', regex= False)

#### What is regex (regular expression)?
-> Regex is a pattern used to find, match, or replace text.

#### Why regex exists?
-> Instead of matching exact text, regex matches patterns.

| Pattern | Meaning              |
| ------- | -------------------- |
| `$`     | end of string        |
| `.`     | any single character |
| `\d`    | any digit (0‚Äì9)      |
| `+`     | one or more          |
| `*`     | zero or more         |

`df['Price'].str.replace('$', '')`

Here:
$ in regex means ‚Äúend of string‚Äù
NOT the dollar symbol

`df['Price'].str.replace('$', '', regex=False)`

Treat $ as a normal character, not a pattern

In [106]:
df['Price'].dtype

dtype('O')

In [110]:
df['Price'] = df['Price'].astype(float)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Make      9 non-null      object 
 1   Colour    9 non-null      object 
 2   Odometer  10 non-null     float64
 3   Doors     9 non-null      float64
 4   Price     8 non-null      float64
dtypes: float64(3), object(2)
memory usage: 528.0+ bytes


### Task 4 : Groupby()

In [114]:
# Average Price per Make
df.groupby("Make")["Price"].mean()

Make
BMW       22000.000000
Honda      6250.000000
Nissan     3500.000000
Toyota     5166.666667
Name: Price, dtype: float64

In [115]:
# Average Odometer per Colour
df.groupby("Colour")['Odometer'].mean()

Colour
Black     11179.0
Blue      73949.5
Green     73949.5
Red       87899.0
White    113684.5
Name: Odometer, dtype: float64

### Whenever give fien mean X per Y then groupby wil be --> df.groupby(Y)[X].mean()/action

#### GroupBy helps you:
    detect feature importance hints
    spot biases in data
    decide feature transformations
    explain model behavior

## Task 5 : Boolean Filtering

In [124]:
median_odo = df['Odometer'].median()
median_price = df["Price"].median()
median_odo, median_price

(73949.5, 6000.0)

In [128]:
df[(df['Odometer']<median_odo)& (df['Price']>median_price)]

Unnamed: 0,Make,Colour,Odometer,Doors,Price
3,BMW,Black,11179.0,5.0,22000.0
9,,White,31600.0,4.0,9700.0


## Task 6 : Feature Selection for ML

In [137]:
X = df.drop("Price", axis = 1)
y = df["Price"]

In [138]:
X, y

(     Make Colour  Odometer  Doors
 0  Toyota  White  150043.0    4.0
 1   Honda    Red   87899.0    4.0
 2  Toyota   Blue   73949.5    3.0
 3     BMW  Black   11179.0    5.0
 4  Nissan  White  213095.0    4.0
 5  Toyota  Green   73949.5    4.0
 6   Honda    NaN   73949.5    4.0
 7   Honda   Blue   73949.5    4.0
 8  Toyota  White   60000.0    NaN
 9     NaN  White   31600.0    4.0,
 0     4000.0
 1     5000.0
 2     7000.0
 3    22000.0
 4     3500.0
 5     4500.0
 6     7500.0
 7        NaN
 8        NaN
 9     9700.0
 Name: Price, dtype: float64)

#### Why is X not ready for an ML model yet?
-> X contains categorical values like Make, Colours

#### Which columns must be transformed?
-> Make, Colours

#### What transformation is required (high-level, no code)?
-> OneHotEncoding, Categorical Encoding

`() ‚Üí do something`

`[] ‚Üí get something`

`{} ‚Üí store something`

In [140]:
df.isnullA 

(<bound method DataFrame.isnull of      Make Colour  Odometer  Doors    Price
 0  Toyota  White  150043.0    4.0   4000.0
 1   Honda    Red   87899.0    4.0   5000.0
 2  Toyota   Blue   73949.5    3.0   7000.0
 3     BMW  Black   11179.0    5.0  22000.0
 4  Nissan  White  213095.0    4.0   3500.0
 5  Toyota  Green   73949.5    4.0   4500.0
 6   Honda    NaN   73949.5    4.0   7500.0
 7   Honda   Blue   73949.5    4.0      NaN
 8  Toyota  White   60000.0    NaN      NaN
 9     NaN  White   31600.0    4.0   9700.0>,
     Make  Colour  Odometer  Doors  Price
 0  False   False     False  False  False
 1  False   False     False  False  False
 2  False   False     False  False  False
 3  False   False     False  False  False
 4  False   False     False  False  False
 5  False   False     False  False  False
 6  False    True     False  False  False
 7  False   False     False  False   True
 8  False   False     False   True   True
 9   True   False     False  False  False)

### `bound method means: ‚ÄúThis function (isnull) is attached to this specific DataFrame (df), but it has NOT been run yet.‚Äù`

### One-line memory rule

`Use & | ~ with parentheses for Pandas filtering`

`Use and | or | not only for single booleans`