## pandas
- Used for EDA
- used for data cleaning, etc
- importing/exporting data, creating/deleting columns,

#### Topics:
- Dataframe: setoptions, map, apply, applymap, lambda
- dataframe: sampling, sorting,
- Dataframe: statistical operation like mean, etc

In [4]:
import pandas as pd
import numpy as np

print(pd.__version__)
data_file = 'data.csv'

2.2.3


## Dataframe

In [35]:
# I want to see all the rows/columns
# step1
df = pd.read_csv('breast_cancer_modified.csv') # i cannot see all rows

# step2
pd.set_option('display.max_rows', None) # to see all the rows
# pd.set_option('display.max_columns', None)) # to see all the columns

df = pd.read_csv('breast_cancer_modified.csv')
print(df)

# step3 when everything is done then reset back.
pd.reset_option('display.max_rows') 

     mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0         17.990         10.38          122.80     1001.0          0.11840   
1         20.570         17.77          132.90     1326.0          0.08474   
2         19.690         21.25          130.00     1203.0          0.10960   
3         11.420         20.38           77.58      386.1          0.14250   
4         20.290         14.34          135.10     1297.0          0.10030   
5         12.450         15.70           82.57      477.1          0.12780   
6         18.250         19.98          119.60     1040.0          0.09463   
7         13.710         20.83           90.20      577.9          0.11890   
8         13.000         21.82           87.50      519.8          0.12730   
9         12.460         24.04           83.97      475.9          0.11860   
10        13.780         15.79           88.37      585.9          0.08817   
11        10.570         18.32           66.82      340.9       

## apply, lambda,  map (IGNORE), applymap (IGNORE),
**apply and lambda** are important and they can accomplish all of above

#### apply
- Used with: **Series or DataFrame**
- Purpose: Apply a function to rows or columns (if DataFrame) or elements (if Series)
- More powerful than map

#### (IGNORE) map
- Used with: **Series only (1D)**
- Purpose: Apply a function element-wise

#### (IGNORE) applymap
- Used with: **DataFrame only (2D)**
- Purpose: Apply a function to every single element
- Like: Nested .apply() for all cells

In [5]:
pd.reset_option('display.max_rows')

df = pd.read_csv('data.csv')
print(df)

# Using apply() on a Series (column-wise operation)
df['Tax'] = df['Salary'].apply(lambda x: x * 0.1)
print("\nAfter using apply() to calculate tax from Salary:\n", df)

# Using apply() across rows (axis=1)
# df['info'] = df.apply(lambda row: "XXX", axis=1 ) #step1
df['info'] = df.apply(lambda row: f"{row['Name']} is {row['Age']} years old", axis=1) #step2
print("\nAfter using apply() across rows to generate Info:\n", df)


df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Senior')
# df['Age_Group'] = df['Age'].map(lambda x: 'Young' if x < 30 else 'Senior')
print("\nAfter:\n", df)

       Name  Age         City  Experience  Experience2   Salary passport
0     Alice   25      Chicago           2            2  70000.0      a43
1       Bob   30  Los Angeles           5            5  80000.0      a44
2   Charlie   35      Chicago           7            7      NaN      a45
3     David   40      Houston          10           10  90000.0      a46
4       Eva   22      Houston           1            1  48000.0      a47
5     Frank   28          NaN           3            3  72000.0      a48
6     Grace   32  San Antonio           6            6  85000.0      a49
7     Helen   26    San Diego           2            2  62000.0      a50
8     Helen   26    San Diego           2            2  62000.0      a51
9     Helen   26    San Diego           2            2  62000.0      a52
10    Jerry   23      Phoenix           6            6  78000.0      a53

After using apply() to calculate tax from Salary:
        Name  Age         City  Experience  Experience2   Salary passport

In [9]:
# Now I want to define a custom function

## 1) row-wise CF
def categorize_person(row):
    if row['Age'] < 30:
        return 'Young'
    else:
        return 'Senior'

df['Age_group'] = df.apply(categorize_person, axis=1)
print(df)

## 2) Row-wise CF
def categorize_person(row):
    if row['Age'] < 30 and row['Experience'] < 5:
        return 'Junior'
    elif row['Experience'] >= 7:
        return 'Senior'
    else:
        return 'Mid-Level'

df['Status'] = df.apply(categorize_person, axis=1)
print(df)

## 2) Column-wise custom function
# Replace missing values in 'Salary' with the average salary
def fill_salary(col):
    return col.fillna(col.mean())

df['Salary1'] = df[['Salary']].apply(fill_salary) # or use Salary
print(df)

# 3) add "USD " to the Salary column
def add_usd(salary):
    if pd.notnull(salary):
        return f"USD {salary:,.2f}"
    return salary  # Keep NaN as is
    
df['Salary2'] = df['Salary'].apply(add_usd) # or use Salary
print(df)



       Name  Age         City  Experience  Experience2   Salary passport  \
0     Alice   25      Chicago           2            2  70000.0      a43   
1       Bob   30  Los Angeles           5            5  80000.0      a44   
2   Charlie   35      Chicago           7            7  70900.0      a45   
3     David   40      Houston          10           10  90000.0      a46   
4       Eva   22      Houston           1            1  48000.0      a47   
5     Frank   28          NaN           3            3  72000.0      a48   
6     Grace   32  San Antonio           6            6  85000.0      a49   
7     Helen   26    San Diego           2            2  62000.0      a50   
8     Helen   26    San Diego           2            2  62000.0      a51   
9     Helen   26    San Diego           2            2  62000.0      a52   
10    Jerry   23      Phoenix           6            6  78000.0      a53   

       Tax                     Info     Status  Salary1        Salary2  
0   7000.0    

### sampling

In [37]:
#syntax: df.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)

print(df.sample(n=3)) # Sample 3 random rows
print(df.sample(frac=0.2)) #  Sample 20% of the rows
print(df.sample(n=5, replace=True)) # Sample with replacement, so duplicates can appear
print(df.sample(n=3, random_state=42))

    Name  Age       City  Experience  Experience2   Salary Age_Group     Tax  \
3  David   40    Houston          10           10  90000.0    Senior  9000.0   
9  Helen   26  San Diego           2            2  62000.0     Young  6200.0   
8  Helen   26  San Diego           2            2  62000.0     Young  6200.0   

                    Info  
3  David is 40 years old  
9  Helen is 26 years old  
8  Helen is 26 years old  
       Name  Age     City  Experience  Experience2   Salary Age_Group     Tax  \
2   Charlie   35  Chicago           7            7      NaN    Senior     NaN   
10    Jerry   23  Phoenix           6            6  78000.0     Young  7800.0   

                       Info  
2   Charlie is 35 years old  
10    Jerry is 23 years old  
     Name  Age         City  Experience  Experience2   Salary Age_Group  \
8   Helen   26    San Diego           2            2  62000.0     Young   
3   David   40      Houston          10           10  90000.0    Senior   
10  Jerry   

## sort_values()

In [38]:

# 1. Sort by a single column (Age) in ascending order
sorted_by_age = df.sort_values(by='Age')
print("\nSorted by Age (ascending):\n", sorted_by_age)

# 2. Sort by a single column (Salary) in descending order
sorted_by_salary = df.sort_values(by='Salary', ascending=False)
print("\nSorted by Salary (descending):\n", sorted_by_salary)

# 3. Sort by multiple columns: first by Salary descending, then by Age ascending
sorted_multi = df.sort_values(by=['Salary', 'Age'], ascending=[False, True])
print("\nSorted by Salary (desc) and Age (asc):\n", sorted_multi)


Sorted by Age (ascending):
        Name  Age         City  Experience  Experience2   Salary Age_Group  \
4       Eva   22      Houston           1            1  48000.0     Young   
10    Jerry   23      Phoenix           6            6  78000.0     Young   
0     Alice   25      Chicago           2            2  70000.0     Young   
7     Helen   26    San Diego           2            2  62000.0     Young   
8     Helen   26    San Diego           2            2  62000.0     Young   
9     Helen   26    San Diego           2            2  62000.0     Young   
5     Frank   28          NaN           3            3  72000.0     Young   
1       Bob   30  Los Angeles           5            5  80000.0    Senior   
6     Grace   32  San Antonio           6            6  85000.0    Senior   
2   Charlie   35      Chicago           7            7      NaN    Senior   
3     David   40      Houston          10           10  90000.0    Senior   

       Tax                     Info  
4   4800

## statistical operations
- mean, median, sum, etc

In [26]:
# important ones
df = pd.read_csv('data.csv')
print(df)

# Or use below for another dataset
# import seaborn as sns
# df = sns.load_dataset("tips")

print(df.shape) # (rows,columns)
print(df.describe())
print(df.info())

       Name  Age         City  Experience  Experience2   Salary passport
0     Alice   25      Chicago           2            2  70000.0      a43
1       Bob   30  Los Angeles           5            5  80000.0      a44
2   Charlie   35      Chicago           7            7      NaN      a45
3     David   40      Houston          10           10  90000.0      a46
4       Eva   22      Houston           1            1  48000.0      a47
5     Frank   28          NaN           3            3  72000.0      a48
6     Grace   32  San Antonio           6            6  85000.0      a49
7     Helen   26    San Diego           2            2  62000.0      a50
8     Helen   26    San Diego           2            2  62000.0      a51
9     Helen   26    San Diego           2            2  62000.0      a52
10    Jerry   23      Phoenix           6            6  78000.0      a53
(11, 7)
             Age  Experience  Experience2        Salary
count  11.000000   11.000000    11.000000     10.000000
mean

In [13]:
df = pd.read_csv('data.csv')
print(df)

numeric_cols = df.select_dtypes(include='number').columns
df = df[numeric_cols]

# Or use below for another dataset
# import seaborn as sns
# df = sns.load_dataset("tips")
# numeric_cols = df.select_dtypes(include='number').columns
# df = df[numeric_cols]


print("\n--- Column-wise Statistics ---")
print("Mean:\n", df.mean())
print("Median:\n", df.median())
print("Mode:\n", df.mode().iloc[0])
print("Standard Deviation:\n", df.std())
print("Min:\n", df.min())
print("Max:\n", df.max())
print("Sum:\n", df.sum())

## or u can do for each column
print(df['Age'].mean())
print(df['Age'].sum())

# or do for all numerics columns
# print(df.mean()) # # ERROR because of string columns
print(df.mean(numeric_only=True))
print(df.sum(numeric_only=True))



       Name  Age         City  Experience  Experience2   Salary passport
0     Alice   25      Chicago           2            2  70000.0      a43
1       Bob   30  Los Angeles           5            5  80000.0      a44
2   Charlie   35      Chicago           7            7      NaN      a45
3     David   40      Houston          10           10  90000.0      a46
4       Eva   22      Houston           1            1  48000.0      a47
5     Frank   28          NaN           3            3  72000.0      a48
6     Grace   32  San Antonio           6            6  85000.0      a49
7     Helen   26    San Diego           2            2  62000.0      a50
8     Helen   26    San Diego           2            2  62000.0      a51
9     Helen   26    San Diego           2            2  62000.0      a52
10    Jerry   23      Phoenix           6            6  78000.0      a53

--- Column-wise Statistics ---
Mean:
 Age               28.454545
Experience         4.181818
Experience2        4.181818
S

In [45]:
# (SKIP) Sample DataFrame
data = {
    'a': [10, 20, 30, 40],
    'b': [5, 15, 25, 35],
    'c': [2, 4, 6, 8],
    'd': [100, 200, 300, 400]
}
df = pd.DataFrame(data)

# Or use below for another dataset
# import seaborn as sns
# df = sns.load_dataset("tips")
# numeric_cols = df.select_dtypes(include='number').columns
# df = df[numeric_cols]

print("Original DataFrame:\n", df)

print("\n--- Column-wise Statistics ---")
print("Mean:\n", df.mean())
print("Median:\n", df.median())
print("Mode:\n", df.mode().iloc[0])
print("Standard Deviation:\n", df.std())
print("Min:\n", df.min())
print("Max:\n", df.max())
print("Sum:\n", df.sum())

print("\n--- Row-wise Statistics ---")
print("Mean:\n", df.mean(axis=1))
print("Median:\n", df.median(axis=1))
print("Mode:\n", df.mode(axis=1))  # Row-wise mode can return multiple values
print("Standard Deviation:\n", df.std(axis=1))
print("Min:\n", df.min(axis=1))
print("Max:\n", df.max(axis=1))
print("Sum:\n", df.sum(axis=1))


Original DataFrame:
     a   b  c    d
0  10   5  2  100
1  20  15  4  200
2  30  25  6  300
3  40  35  8  400

--- Column-wise Statistics ---
Mean:
 a     25.0
b     20.0
c      5.0
d    250.0
dtype: float64
Median:
 a     25.0
b     20.0
c      5.0
d    250.0
dtype: float64
Mode:
 a     10
b      5
c      2
d    100
Name: 0, dtype: int64
Standard Deviation:
 a     12.909944
b     12.909944
c      2.581989
d    129.099445
dtype: float64
Min:
 a     10
b      5
c      2
d    100
dtype: int64
Max:
 a     40
b     35
c      8
d    400
dtype: int64
Sum:
 a     100
b      80
c      20
d    1000
dtype: int64

--- Row-wise Statistics ---
Mean:
 0     29.25
1     59.75
2     90.25
3    120.75
dtype: float64
Median:
 0     7.5
1    17.5
2    27.5
3    37.5
dtype: float64
Mode:
    0   1   2    3
0  2   5  10  100
1  4  15  20  200
2  6  25  30  300
3  8  35  40  400
Standard Deviation:
 0     47.281956
1     93.738555
2    140.215013
3    186.696501
dtype: float64
Min:
 0    2
1    4
2    6
3 

In [24]:
# I want to see unique values of a column and their count

# example1
df = pd.read_csv('data.csv')
print(df)
print("0000000000000")
print(df['City'].unique()) # prints uniques values
print("111111111111111")
print(df['City'].value_counts())# prints uniques values count

# example2
df = pd.read_csv('https://raw.githubusercontent.com/ash322ash422/tut_pandas_numpy/refs/heads/master/titanic.csv', sep=',')
print(df.head(5))
print("0000000000000")

print(df['SibSp'].unique()) # prints uniques values
print("111111111111111")
print(df['SibSp'].value_counts())# prints uniques values count

       Name  Age         City  Experience  Experience2   Salary passport
0     Alice   25      Chicago           2            2  70000.0      a43
1       Bob   30  Los Angeles           5            5  80000.0      a44
2   Charlie   35      Chicago           7            7      NaN      a45
3     David   40      Houston          10           10  90000.0      a46
4       Eva   22      Houston           1            1  48000.0      a47
5     Frank   28          NaN           3            3  72000.0      a48
6     Grace   32  San Antonio           6            6  85000.0      a49
7     Helen   26    San Diego           2            2  62000.0      a50
8     Helen   26    San Diego           2            2  62000.0      a51
9     Helen   26    San Diego           2            2  62000.0      a52
10    Jerry   23      Phoenix           6            6  78000.0      a53
0000000000000
['Chicago' 'Los Angeles' 'Houston' nan 'San Antonio' 'San Diego' 'Phoenix']
111111111111111
City
San Diego    

In [22]:
# I want to find correlation coeff. between features in tips dataset: -1 to 0 to +1
# Or use below for another dataset
import seaborn as sns
df = sns.load_dataset("tips") # or titanic, tips

numeric_cols = df.select_dtypes(include='number').columns
print(df[numeric_cols].corr())

            total_bill       tip      size
total_bill    1.000000  0.675734  0.598315
tip           0.675734  1.000000  0.489299
size          0.598315  0.489299  1.000000
