---
title: 'Data Analysis'
author: 'Zachary Belgum'
Institute: 'Tunghai University' 
number-sections: true
toc: true
execute:
    daemon: true
format: 
    html:
         embed-resources: true
---


## Key Points

- Data loading
- Data inspection
    - Structure
    - Value
- Data Cleaning
    - Dealing with missing data
    - Detect outliers
- Data transformation
    - Encoding

Removing Duplicates

## Load Dataset

Load dataset from CSV file into df using pandas


In [None]:
import pandas as pd
df = pd.read_csv('customer_data.csv')

## Data Inspection

- Structure
    - Dimension of dataset
    - Column data type
- Value
    - Missing value
    - Outliers

### Dimensions of dataset


In [None]:
df.shape

### Column Data type


In [None]:
df.info()

## Check first few rows


In [None]:
df.head()

## Check last few rows


In [None]:
df.tail()

## Basic statistics


In [None]:
df.describe()

## Round to 2 decimal points


In [None]:
df.describe().round(2)

## Dataset


In [None]:
#| echo: false
df

## Visualization


In [None]:
import matplotlib.pylab as plt
df['Age'].hist()
plt.show()

## Correlation among columns


In [None]:
df.plot.scatter(x = 'Age', y = 'Income')
plt.show()
# import seaborn as sns
# sns.pairplot(df)

## Data Cleaning


In [None]:
df = df.drop_duplicates(subset='CustomerID')

## Dealing with Missing Data

- Imputing


In [None]:
df.loc[df['Age'].isna(), 'Age'] = df['Age'].median()
df.loc[df['Income'].isna(), 'Income'] = df['Income'].mean()

- Removing


In [None]:
df = df.dropna(subset=['Gender'])

## Outlier Detection and Treatment


In [None]:
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

In [None]:
df.loc[df['Income'] > upper_bound, 'Income'] = upper_bound
df.loc[df['Income'] < lower_bound, 'Income'] = lower_bound

## Data Transformation

## Before encoding

## Encoding: M->0, F->1


In [None]:
df.loc[:,'Gender_encoded'] = df['Gender'].map({'M': 0, 'F': 1})

## After encoding