# Pandas Basics for Machine Learning

## Section 0: Introduction to Pandas and Its Importance for Machine Learning

### Why Pandas is Important for Machine Learning

The name "Pandas" is derived from the term "Panel Data", an econometrics term for multidimensional structured data sets. Pandas is a powerful and flexible Python library used for data manipulation and analysis. It is essential for machine learning for several reasons:

#### 1. Handling Tabular Data
- **Tabular Data Management**: Machine learning often involves working with structured, tabular data. Pandas provides efficient data structures like DataFrames and Series to handle such data. These structures are intuitive and allow for complex operations with simple syntax.
- **Data Importing**: Pandas can read data from various file formats such as CSV, Excel, SQL databases, and more, making it easy to import datasets for analysis and modeling.

#### 2. Data Cleaning and Preparation
- **Handling Missing Data**: Real-world data is often messy and contains missing values. Pandas provides straightforward methods to detect, handle, and fill missing data, which is crucial for preparing clean datasets for machine learning models.
- **Filtering and Sorting**: Pandas allows for easy filtering, sorting, and subsetting of data based on specific criteria, helping to prepare and clean data efficiently.

#### 3. Data Transformation
- **Feature Engineering**: Creating new features from existing data is a key step in improving model performance. Pandas offers powerful tools for manipulating and transforming data, enabling effective feature engineering.
- **Merging and Joining**: Combining multiple datasets is a common task in data analysis. Pandas provides robust methods for merging, joining, and concatenating datasets, facilitating comprehensive data preparation.

#### 4. Exploratory Data Analysis (EDA)
- **Descriptive Statistics**: Pandas allows for quick calculation of summary statistics, such as mean, median, standard deviation, etc., providing insights into the data distribution and helping in identifying patterns and anomalies.
- **Data Visualization**: Although not a visualization library, Pandas integrates well with libraries like Matplotlib and Seaborn, enabling easy creation of plots and charts for data exploration.

#### 5. Integration with Machine Learning Libraries
- **Seamless Integration & DataFrame Compatibility**: Many machine learning functions and models in libraries like Scikit-learn accept Pandas DataFrames as input, making it convenient to directly use preprocessed data for training and evaluation.

#### Conclusion
Pandas is the backbone of data manipulation and preparation in the machine learning workflow. Its rich functionality and ease of use make it indispensable for data scientists and machine learning practitioners. By mastering Pandas, you can streamline your data handling processes and build more accurate and robust machine learning models.


## Section 1: Getting Started with Pandas

In this section, we'll cover the basics of getting started with pandas, including how to import the library and understand its core data structures: Series and DataFrame.

In [1]:
import pandas as pd
import numpy as np
# Display version
print(pd.__version__)

# Creating a Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

# Creating a DataFrame
dates = pd.date_range('20230101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df)

df = pd.DataFrame(
{"a" : [4, 5, 6],
"b" : [7, 8, 9],
"c" : [10, 11, 12]},
index = [1, 2, 3])
print(df)

df = pd.DataFrame(
[[4, 7, 10],
[5, 8, 11],
[6, 9, 12]],
index=[1, 2, 3],
columns=['a', 'b', 'c'])
print(df)

2.2.2
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
                   A         B         C         D
2023-01-01 -0.658230 -1.474535 -0.013956  0.167755
2023-01-02 -0.018891 -0.536357  1.415256 -0.269086
2023-01-03  0.759182  0.022507  0.286455  0.289033
2023-01-04  0.429143 -0.058194 -1.339836  0.861946
2023-01-05  0.044058 -1.576313  1.738802 -1.365960
2023-01-06  1.341321  0.787794  1.183616  1.001367
   a  b   c
1  4  7  10
2  5  8  11
3  6  9  12
   a  b   c
1  4  7  10
2  5  8  11
3  6  9  12


In [2]:
df = pd.DataFrame(
    {"a" : [4 ,5, 6],
    "b" : [7, 8, 9],
    "c" : [10, 11, 12]},
    index = pd.MultiIndex.from_tuples(
    [('d', 1), ('d', 2),
    ('e', 2)], names=['n', 'v'])
    )
print(df)

     a  b   c
n v          
d 1  4  7  10
  2  5  8  11
e 2  6  9  12


## Section 2: Loading Data with Pandas
In this section, we'll learn how to load data into a Pandas DataFrame. We'll use the `pd.read_csv()` function to read a CSV file.

In [3]:
import pandas as pd

# Load a sample dataset
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
df = pd.read_csv(url) # read_csv can be replaced with read_excel, read_json, etc.

df.head(10)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False


## Section 3: Exploring the Data
In this section, we'll learn how to explore our dataset. We'll use various functions to understand the structure and summary statistics of the data.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null    int64  
 2   sex          891 non-null    object 
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64  
 5   parch        891 non-null    int64  
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object 
 8   class        891 non-null    object 
 9   who          891 non-null    object 
 10  adult_male   891 non-null    bool   
 11  deck         203 non-null    object 
 12  embark_town  889 non-null    object 
 13  alive        891 non-null    object 
 14  alone        891 non-null    bool   
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.4+ KB


In [5]:
df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [6]:
df.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

## Section 4: Sorting & Filtering
In this section, we'll learn how to sort and filter data in a DataFrame. We'll use the `sort_values()` function to sort data and `loc`/`iloc` for filtering.

In [7]:
df_sorted = df.sort_values(by='age')

df_sorted.head()

df_filtered = df[df['age'] > 30]

df_filtered.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
11,1,1,female,58.0,0,0,26.55,S,First,woman,False,C,Southampton,yes,True


In [8]:
df_filtered_multi = df[(df['age'] > 30) & (df['sex'] == 'female')]

df_filtered_multi.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
11,1,1,female,58.0,0,0,26.55,S,First,woman,False,C,Southampton,yes,True
15,1,2,female,55.0,0,0,16.0,S,Second,woman,False,,Southampton,yes,True
18,0,3,female,31.0,1,0,18.0,S,Third,woman,False,,Southampton,no,False


## Section 5: Data Cleaning

Data cleaning is a crucial step in preparing data for analysis. This section covers handling missing data and removing duplicates.

In [10]:
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
from sklearn.preprocessing import OneHotEncoder

# Load the dataset
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
df = pd.read_csv(url)
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [17]:
# Handling missing data
print("Missing values in each column:\n", df.isnull().sum())

Missing values in each column:
 survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64


In [18]:
# Drop rows with missing data
df_dropped = df.dropna()

df_dropped

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
10,1,3,female,4.0,1,1,16.7000,S,Third,child,False,G,Southampton,yes,False
11,1,1,female,58.0,0,0,26.5500,S,First,woman,False,C,Southampton,yes,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
871,1,1,female,47.0,1,1,52.5542,S,First,woman,False,D,Southampton,yes,False
872,0,1,male,33.0,0,0,5.0000,S,First,man,True,B,Southampton,no,True
879,1,1,female,56.0,0,1,83.1583,C,First,woman,False,C,Cherbourg,yes,False
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True


In [19]:
# Fill missing data with a specified value
df_filled = df.fillna(value=0)
df_filled

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,0,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,0,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,0,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,0,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,0.0,1,2,23.4500,S,Third,woman,False,0,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [20]:
# Fill missing data using forward fill
df_ffill = df.ffill()
df_ffill

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,C,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,C,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,C,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,19.0,1,2,23.4500,S,Third,woman,False,B,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [21]:
# Fill missing data using backward fill
df_bfill = df.bfill()
df_bfill

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,C,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,C,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,E,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,B,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,26.0,1,2,23.4500,S,Third,woman,False,C,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [49]:
# One-hot encode the categorical columns
# get categorical columns from the dataset and drop the target column, it will be better to count the unique values in each column to see if it is a categorical column
categorical_columns = []
for col in df.columns:
    if df[col].nunique() < 10:
        categorical_columns.append(col)
categorical_columns.remove('survived')

one_hot_encoder = OneHotEncoder(sparse_output=False)
one_hot_encoded = one_hot_encoder.fit_transform(df[categorical_columns])
one_hot_encoded_df = pd.DataFrame(one_hot_encoded, columns=one_hot_encoder.get_feature_names_out(categorical_columns))

# Concatenate the one-hot encoded columns back with the original dataframe
df_encoded = pd.concat([df.drop(columns=categorical_columns), one_hot_encoded_df], axis=1)

# Fill missing data using KNN imputation
imputer = KNNImputer(n_neighbors=5)  # You can set n_neighbors to the desired value
df_knn_imputed = pd.DataFrame(imputer.fit_transform(df_encoded), columns=df_encoded.columns)
print("Missing values after KNN imputation:\n", df_knn_imputed.isnull().sum())


Missing values after KNN imputation:
 survived                   0
age                        0
fare                       0
pclass_1                   0
pclass_2                   0
pclass_3                   0
sex_female                 0
sex_male                   0
sibsp_0                    0
sibsp_1                    0
sibsp_2                    0
sibsp_3                    0
sibsp_4                    0
sibsp_5                    0
sibsp_8                    0
parch_0                    0
parch_1                    0
parch_2                    0
parch_3                    0
parch_4                    0
parch_5                    0
parch_6                    0
embarked_C                 0
embarked_Q                 0
embarked_S                 0
embarked_nan               0
class_First                0
class_Second               0
class_Third                0
who_child                  0
who_man                    0
who_woman                  0
adult_male_False           0
adult

In [38]:
# Display the imputed DataFrame
df_knn_imputed.head()

Unnamed: 0,survived,age,fare,pclass_1,pclass_2,pclass_3,sex_female,sex_male,sibsp_0,sibsp_1,...,deck_G,deck_nan,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton,embark_town_nan,alive_no,alive_yes,alone_False,alone_True
0,0.0,22.0,7.25,0.0,0.0,1.0,0.0,1.0,0.0,1.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
1,1.0,38.0,71.2833,1.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
2,1.0,26.0,7.925,0.0,0.0,1.0,1.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
3,1.0,35.0,53.1,1.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0
4,0.0,35.0,8.05,0.0,0.0,1.0,0.0,1.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0


## Section 6: Groupby
In this section, we'll learn how to group data using the `groupby` function and perform aggregate operations.

In [26]:
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [25]:
grouped = df.groupby('class').mean()
grouped

grouped_multi = df.groupby(['class', 'sex']).mean()
grouped_multi

TypeError: agg function failed [how->mean,dtype->object]

## Section 5: Handling Missing Data
In this section, we'll learn how to handle missing data in a DataFrame. We'll use functions like `isnull()`, `dropna()`, and `fillna()`.

In [None]:
missing_values = df.isnull().sum()
missing_values

df_filled = df.fillna(df.mean())
df_filled.head()

df_dropped = df.dropna()
df_dropped.head()

## Section 6: Merging & Concatenating DataFrames
In this section, we'll learn how to merge and concatenate DataFrames using `merge` and `concat` functions.

In [None]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3'], 'key': ['K0', 'K1', 'K2', 'K3']})

df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3'], 'key': ['K0', 'K1', 'K2', 'K3']})

merged_df = pd.merge(df1, df2, on='key')
merged_df

concatenated_df = pd.concat([df1, df2], axis=0)
concatenated_df

## Section 7: Working with Dates
In this section, we'll learn how to work with date and time data in Pandas. We'll use functions like `pd.to_datetime()` and `dt` accessor.

In [None]:
date_df = pd.DataFrame({'date': ['2023-01-01', '2023-01-02', '2023-01-03'], 'value': [10, 20, 30]})

date_df['date'] = pd.to_datetime(date_df['date'])

date_df['year'] = date_df['date'].dt.year
date_df['month'] = date_df['date'].dt.month
date_df['day'] = date_df['date'].dt.day
date_df

## Section 8: Basic Data Visualization
In this section, we'll learn how to create basic visualizations using Pandas built-in plotting functions and Matplotlib.

In [None]:
import matplotlib.pyplot as plt

df['age'].plot(kind='hist', bins=30, edgecolor='k')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()