## **Phase 1: Week 1 - Project Kick-off and Data Collection**
Objectives:
Data Collection: Retrieve the customer dataset.

Data Preprocessing: Cleanse and preprocess the dataset, addressing any missing values, outliers, or necessary data transformations.

Exploratory Data Analysis (EDA): Perform initial EDA to understand the dataset's structure and gain insights

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Data Collection: Load the provided train and test datasets

In [2]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [5]:
# Load the datasets
train_path = '/content/drive/My Drive/Datasets/Train.csv'
test_path = '/content/drive/My Drive/Datasets/Test.csv'

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

## Basic info

In [7]:
train_df.shape

(31647, 18)

In [8]:
test_df.shape

(13564, 17)

In [9]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31647 entries, 0 to 31646
Data columns (total 18 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   id                                31647 non-null  object 
 1   customer_age                      31028 non-null  float64
 2   job_type                          31647 non-null  object 
 3   marital                           31497 non-null  object 
 4   education                         31647 non-null  object 
 5   default                           31647 non-null  object 
 6   balance                           31248 non-null  float64
 7   housing_loan                      31647 non-null  object 
 8   personal_loan                     31498 non-null  object 
 9   communication_type                31647 non-null  object 
 10  day_of_month                      31647 non-null  int64  
 11  month                             31647 non-null  object 
 12  last

In [10]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13564 entries, 0 to 13563
Data columns (total 17 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   id                                13564 non-null  object 
 1   customer_age                      13294 non-null  float64
 2   job_type                          13564 non-null  object 
 3   marital                           13483 non-null  object 
 4   education                         13564 non-null  object 
 5   default                           13564 non-null  object 
 6   balance                           13383 non-null  float64
 7   housing_loan                      13564 non-null  object 
 8   personal_loan                     13490 non-null  object 
 9   communication_type                13564 non-null  object 
 10  day_of_month                      13564 non-null  int64  
 11  month                             13564 non-null  object 
 12  last

In [None]:
train_df.columns

In [None]:
test_df.columns

In [None]:
# Display unique value counts for important features (assuming important features are known)
important_features = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome']


In [None]:
nique_value_counts_train = train_df[important_features].nunique()
unique_value_counts_test = test_df[important_features].nunique()



In [None]:
(unique_value_counts_train, unique_value_counts_test)

In [None]:
# Updated important features based on the actual column names in the datasets
updated_important_features = [
    'job_type', 'marital', 'education', 'default',
    'housing_loan', 'personal_loan', 'communication_type',
    'month', 'prev_campaign_outcome'
]

# Unique value counts for the updated important features in the train dataset
unique_value_counts_train = train_df[updated_important_features].nunique()

# Unique value counts for the updated important features in the test dataset
unique_value_counts_test = test_df[updated_important_features].nunique()

(unique_value_counts_train, unique_value_counts_test)

In [None]:
# Separate handling for missing values in train and test datasets

# Fill missing values in numeric columns with median
train_df_filled[numeric_columns] = train_df[numeric_columns].apply(lambda x: x.fillna(x.median()), axis=0)
test_df_filled[numeric_columns.drop('term_deposit_subscribed')] = test_df[numeric_columns.drop('term_deposit_subscribed')].apply(lambda x: x.fillna(x.median()), axis=0)

# Fill missing values in categorical columns with mode
train_df_filled[categorical_columns] = train_df[categorical_columns].apply(lambda x: x.fillna(x.mode()[0]), axis=0)
test_df_filled[categorical_columns] = test_df[categorical_columns].apply(lambda x: x.fillna(x.mode()[0]), axis=0)

# Check for missing values after handling
missing_values_train_filled = train_df_filled.isnull().sum()
missing_values_test_filled = test_df_filled.isnull().sum()

(missing_values_train, missing_values_train_filled, missing_values_test, missing_values_test_filled)
