We will learn little bit more about pandas today!

Okay! Let us look at the **.loc[]** and **.iloc[]**

In [None]:
import pandas as pd

In [None]:
# df = pd.read_csv("titanic.csv")
df = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv")
df.head()

**.loc[]** is label based indexing while **.iloc[]** is integer position-based indexing.

- **.loc[]**
    - **.loc[]** is used for selecting rows and columns by labels (names).
    - Inclusive selection: When using slicing (:), both start and end labels are included.
    - Works with boolean indexing.
- **.iloc[]**
  - Used for selecting rows and columns by their integer position.
  - Exclusive selection: Standard Python slicing rules apply (end index is excluded).
  - Does not work with boolean indexing or labels.

In [None]:
df.loc[1]

Example with boolean indexing

In [None]:
df['Pclass'] >= 2

In [None]:
df.loc[df['Pclass'] >= 2]

Let's look at **.iloc[]**.... Let's try to access the 0th index location.

In [None]:
df.iloc[0]

Let's look at the difference between exclusive and inclusive range selection.

In [None]:
df.iloc[10:14]

In [None]:
df.loc[10:14]

In [None]:
df.iloc[1:4]

In [None]:
df.loc[[1,3,4,5,6]]

In [None]:
df.iloc[[1,3,4,5,6]]

In [None]:
df.loc[[1,3,4,5,6], ['Name','Sex']]

However, this does not work with iloc, it needs column locations not column labels.

In [None]:
df.iloc[[1,3,4,5,6], ['Name','Sex']]

In [None]:
df.iloc[100:110, :3]

You can also update and modify certain values by using the ***=*** opperator.

In [None]:
df.loc[1]

In [None]:
df['Pclass'] == 1

Here we look at how we can change some specific locations of a dataframe using a combination of **.loc** and broadcasting.

In [None]:
df.loc[df['Pclass'] == 1, ['Cabin']]

In [None]:
df.loc[df['Pclass'] == 1, ['Cabin']] = "rr"

Here, we changed the values of all the locations of 'Cabin' to 'rr' where the 'Pclass' value is equal to 1

In [None]:
df.loc[df['Pclass'] == 1, ['Cabin']]

This updates everywhere with Pclass = 1

##### When NOT to Use .loc and .iloc

- Avoid using **.loc** when your DataFrame index is not unique or meaningful—this can lead to unexpected results.
- Avoid using **.iloc** if you expect the index to change frequently, as positional indexing can become incorrect.
- Avoid using **.iloc** or **.loc** with large datasets in a loop—vectorized operations are faster.

Now let us move on how to handle missing values.....

In [None]:
df = pd.read_csv("titanic.csv")
df.head()

In [None]:
df.nunique()
df.info()

In [None]:
df['Cabin'].unique()

In [None]:
df.isnull()

In [None]:
df.isnull().sum()

In [None]:
df.info()

In [None]:
!pip install missingno

**missingno** library to **visualize** missing data in a DataFrame.

The matrix function of missingno generates a visual representation of the missing values in your DataFrame. Each column in the matrix corresponds to a column in the DataFrame, and each row represents an observation (data point). Missing values are shown as vertical lines or gaps in the matrix.

This helps in quickly identifying the patterns of missingness in your data and can guide you in deciding how to handle the missing values.

In [None]:
import missingno as msno
msno.matrix(df)

The sparkline at right summarizes the general shape of the data completeness and points out the rows with the maximum and minimum nullity in the dataset.

We can also create a dendogram using missinno library. 

Dendogram is a tree diagram.

The dendrogram in **missingno** uses a hierarchical clustering algorithm to bin variables against one another by their nullity correlation (measured in terms of binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters.

In [None]:
msno.dendrogram(df)

The missingno correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another:

In [None]:
msno.heatmap(df)

Nullity correlation ranges from -1 (if one variable appears the other definitely does not) to 0 (variables appearing or not appearing have no effect on one another) to 1 (if one variable appears the other definitely also does).

In [None]:
msno.bar(df)

In [None]:
df2 = df.copy()
df2.info()

Okay.. Now let us look at what we can do to these missing values. 

Easiest thing is to drop rows that has missing values.

In [None]:
df3 = df2.dropna()

This method removes rows (be default) with missing (NaN) values.

In [None]:
df3.info()

In [None]:
df2.dropna(inplace=True)

Now let us look how to drop rows or columns based on the missing values.

In [None]:
df_drop_rows_with_na = df.copy()
df_drop_cols_with_na = df.copy()
df_drop_all = df.copy()
 
df_drop_cols_with_na.dropna(inplace=True, axis=1)
df_drop_rows_with_na.dropna(inplace=True, axis=0)
df_drop_all.dropna(inplace=True, how='all')

df_drop_cols_with_na.info()
df_drop_rows_with_na.info()
df_drop_all.info()

We can even fill the Null values with something else.

In [None]:
df_fill_with_mean = df.copy()
df_fill_with_mean.head()
df_fill_with_mean.isnull().sum()
age_mean = df_fill_with_mean['Age'].mean()
age_mean

In [None]:
df_fill_with_mean['Age na filled with mean'] = df_fill_with_mean['Age'].fillna(age_mean)
df_fill_with_mean.info()

In [None]:
df_fill_with_mean.head()

You can also use something like forward filling or backward filling.

In [None]:
df_fill_with_fbfilling = df.copy()
df_fill_with_fbfilling['Age na filled with forward filling'] = df_fill_with_fbfilling['Age'].fillna(method= 'ffill')
df_fill_with_fbfilling['Age na filled with backward filling'] = df_fill_with_fbfilling['Age'].fillna(method= 'bfill')
df_fill_with_fbfilling.info()

In [None]:
df_fill_with_fbfilling = df.copy()
df_fill_with_fbfilling['Age na filled with forward filling'] = df_fill_with_fbfilling['Age'].ffill()
df_fill_with_fbfilling['Age na filled with backward filling'] = df_fill_with_fbfilling['Age'].bfill()
df_fill_with_fbfilling.info()

Now let's move to categorical data.

We can replace values with most frequent value. **(Mode Imputation)**

In [None]:
df_fill_categorical_data = df.copy()
most_frequent = df_fill_categorical_data["Cabin"].value_counts()
print(most_frequent.index[0])

In [None]:
df_fill_categorical_data = df.copy()
most_frequent = df_fill_categorical_data["Cabin"].value_counts().index[0]
most_frequent

In [None]:
df_fill_categorical_data['cabin na filled with most freq val'] = df_fill_categorical_data['Cabin'].fillna(most_frequent)
df_fill_categorical_data.info()

Let's look at how we can concatenate two dataframes.

In [None]:
df2.info()
df4 = pd.concat([df2, df2])
df4.info()

Now let's remove duplicate rows from a dataframe.

In [None]:
df4 = df4.drop_duplicates()
# you can pass inplace = True to change the dataframe inplace
# df4.drop_duplicates(inplace=True)
print(df4.shape)
df4.info()


#### Data analysis in pandas

##### Summary Operators

In [None]:
df2.info()
df2.select_dtypes(include='number').mean()

Pandas has inbuilt statistical functions as well.

In [None]:
df2.mean()

This does not work as we have non-numeric columns, so what we can do is to pick the numeric columns and calculate mean.

In [None]:
df2.mean(numeric_only=True)

In [None]:
df2.mode()

This is a nice way to exclude set of columns

In [None]:
df2[df2.columns.difference(['PassengerId', 'Name'])].mode()

In [None]:
df2[["Survived", "Pclass"]].mode()

In [None]:
df2.median(numeric_only=True)

Let's try to create a new column based on the existing columns.

In [None]:
df2['age_to_fare_ratio'] = df2['Age']/df2['Fare']
df2.head()

Let's look at how to do value counts

In [None]:
df2['Sex'].value_counts()

Adding the normalize argument returns proportions instead of absolute counts.

In [None]:
df2['Sex'].value_counts(normalize=True)

In [None]:
df.value_counts(subset=['Sex', 'Embarked'])

- It groups the DataFrame by unique combinations of Sex and Embarked values.

- It counts how many times each unique combination appears.


##### Aggregating with group by in pandas

Here I am going to group by rows using the column **Sex** and look at aggregate values. 

Imagine you want to find the how many Men and Women survived in Titanic..

In [None]:
df2.groupby('Sex').mean(numeric_only=True)

Note that you have to use an aggregate method when you use groupBy..

Let's try to do One-hot Encoding. We will learn use pandas and scikit-learn to do this.
So first install scikit-learn if you have not done so.....

In [None]:
!pip install scikit-learn

In [None]:
hard_coded_data = pd.DataFrame([
    [10, 'M', 'Good'],
    [20, 'F', 'Nice'],
    [15, 'F', 'Good'],
    [25, 'M', 'Great'],
    [30, 'F', 'Nice'],
])
hard_coded_data.columns=['Employee id', 'Gender', 'Remarks']
print(hard_coded_data)

In [None]:
df_encoded_pandas = pd.get_dummies(hard_coded_data, columns=['Gender', 'Remarks'])
df_encoded_pandas

In [None]:
df_encoded_pandas_first_drop = pd.get_dummies(hard_coded_data, columns=['Gender', 'Remarks'], drop_first=True)
df_encoded_pandas_first_drop

The reason that we do this is one of the columns we created by one-hot encoding can be predicted by rest of the columns. This could lead to something called multicolinearity. multicollinearity is, where one column can be perfectly predicted by the others. This can be problematic in certain models, especially linear models like regression.

In [None]:
type(df_encoded_pandas)

Let's look at how we can do this using scikit-learn.

If you have not installed the scikit-learn package, now would be a good time to do so....

In [None]:
!pip install scikit-learn

In [None]:
from sklearn.preprocessing import OneHotEncoder

categorical_columns = ['Gender', 'Remarks']
# Initialize OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
print(hard_coded_data[categorical_columns])
# Fit and transform the categorical columns

one_hot_encoded_sci = encoder.fit_transform(hard_coded_data[categorical_columns])
one_hot_encoded_sci

In [None]:
# Create a DataFrame with the encoded columns
one_hot_df = pd.DataFrame(one_hot_encoded_sci, 
                          columns=encoder.get_feature_names_out(categorical_columns))
one_hot_df

Now we drop the original categorical columns and attach the new one hot encoded columns.

In [None]:
# Concatenate the one-hot encoded columns with the original DataFrame
hard_coded_data_with_one_hot_encoded = pd.concat([hard_coded_data.drop(categorical_columns, axis=1), one_hot_df], axis=1)
hard_coded_data_with_one_hot_encoded