Nama : Ian Sebastian

NIM : 10118024

Asal : Institut Teknologi Bandung

Sumber Dataset : TakeMeOut

Tugas Google DSC ITB Data Science -- Data Exploration


## Preliminary Actions

### Importing Necessary Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Importing Dataset

In [None]:
data_src = 'https://raw.githubusercontent.com/iansebastian59/DSCDataScience/main/takemeout.csv'
df = pd.read_csv(data_src)

### Viewing the Dataset

In [None]:
df.head(4)

### View Column Names

In [None]:
# Get current column names
print(df.columns)

### Examine dataset datatype

In [None]:
print(df.info())

### Examine Data Distribution

In [None]:
print(df.describe())
print(df.describe(include='object'))

### Takeaways:

In **bold** are actionable insights.

General:
- **Column names are way too long**
- No null data, no need to treat

Numerical Columns
- No anomalous values

Datetime Columns
- '**'Timestamp' has not been converted to datetime format**

Categorical Columns
- 'Gender' has only 2 values : 'Cowok' and 'Cewek', as expected.
- **'Name' doesn't really have important insight since everything is censored.**

## Data Cleaning


### 1. Modify Column Names

Try to improve readability and verbosity.

Original column name -- New column name
- *Timestamp* -- timestamp
- *Siapa nama kamu* : name
- *Cewek atau cowok nih?* -- gender
- *Seberapa penting quality time bareng calon pacar untuk kamu?* -- quality_time
- *Seberapa penting physical touch sama calon pacar untuk kamu?* -- phys_touch
- *Seberapa penting word of affirmation dari calon pacar untuk kamu?* -- affirmation
- *Seberapa penting dapet kado dari calon pacar untuk kamu?* -- gift
- *Seberapa penting bantuan dari calon pacar untuk kamu?* -- help

In [None]:
# Change column names
new_colnames = ['Timestamp', 'name', 'gender', 'quality_time', 'phys_touch', 'affirmation', 'gift', 'help']
df.columns = new_colnames

In [None]:
# see new column names:
df.head(3)

### 2. Drop Non-insightful Data

We'll drop the `name` column since everything is censored so no information can be extracted.

In [None]:
df = df[['Timestamp',  'gender', 'quality_time', 'phys_touch', 'affirmation', 'gift', 'help']]
df.head(3)

### 3. Convert Datetime Column

We'll do this with pandas `to_datetime` method.

In [None]:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
print("Datatype of Timestamp: ", df['Timestamp'].dtype)

### 4. Convert Numerics to Categorical

In [None]:
num_cols = ['quality_time', 'phys_touch', 'affirmation', 'gift', 'help']

df[num_cols] = df[num_cols].astype("int64")

In [None]:
print(df.info())

In [None]:
df.describe(include="object")

## Data Exploration

### Descriptive Statistics

In [None]:
# Categorical Data
print(df.describe(include="object"))

### Visualization

In [None]:
%matplotlib inline
sns.set_style("ticks")

cols = ['gender', 'quality_time', 'phys_touch',	'affirmation', 'gift', 'help']

fig, axs = plt.subplots(2, 3, figsize=(10, 10))

axs = axs.ravel()

for (i, col) in enumerate(cols):
    sns.countplot(ax = axs[i+1], x=col, data=df)
    axs[i].set_xlabel(col)

_ = plt.tight_layout()


In [None]:
%matplotlib inline
_ = sns.catplot(data=df, x="quality_time", kind="count", col="gender")

In [None]:
%matplotlib inline
_ = sns.catplot(data=df, x="phys_touch", kind="count", col="gender")

In [None]:
%matplotlib inline
_ = sns.catplot(data=df, x="affirmation", kind="count", col="gender")

In [None]:
%matplotlib inline
_ = sns.catplot(data=df, x="gift", kind="count", col="gender")

In [None]:
%matplotlib inline
_ = sns.catplot(data=df, x="help", kind="count", col="gender")

### Insights:
- Lelaki:
    - Menghargai hadiah
    - Tidak terlalu menghargai quality time
    - Menghargai ketika dibantu partner
- Perempuan:
    - Menghargai ketika dibantu partner
    - Menyukai physical touch