## Data Preprocessing: Subsetting, Cleaning, and Reformatting `bank_marketing.csv`

1.  **Subset:**  Extract relevant columns for each of the three DataFrames: `client`, `campaign`, and `economics`.
2.  **Clean:** Address any data quality issues within each subset (e.g., inconsistent data types, missing values - although this dataset is relatively clean already).
3.  **Reformat:** Ensure the data types of each column align with the specifications provided in the notebook.
4.  **Store:** Save each DataFrame as a separate `.csv` file without the index: `client.csv`, `campaign.csv`, and `economics.csv`.

The resulting files will facilitate more targeted analysis related to:

*   **Client:**  Demographic and personal information about the clients.
*   **Campaign:**  Details and results of the marketing campaign.
*   **Economics:**  Relevant economic indicators at the time of the campaign.

In [1]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv('bank_marketing.csv')

In [5]:
df.head()

Unnamed: 0,client_id,age,job,marital,education,credit_default,mortgage,month,day,contact_duration,number_contacts,previous_campaign_contacts,previous_outcome,cons_price_idx,euribor_three_months,campaign_outcome
0,0,56,housemaid,married,basic.4y,no,no,may,13,261,1,0,nonexistent,93.994,4.857,no
1,1,57,services,married,high.school,unknown,no,may,19,149,1,0,nonexistent,93.994,4.857,no
2,2,37,services,married,high.school,no,yes,may,23,226,1,0,nonexistent,93.994,4.857,no
3,3,40,admin.,married,basic.6y,no,no,may,27,151,1,0,nonexistent,93.994,4.857,no
4,4,56,services,married,high.school,no,no,may,3,307,1,0,nonexistent,93.994,4.857,no


In [7]:
# for the job column, Change "." to "_" 
# Replace "." with "_" in the 'job' column
df['job'] = df['job'].str.replace('.', '_')

# for the education column, Change "." to "_" and "unknown" to np.NaN
# Replace "." with "_" in the 'education' column
df['education'] = df['education'].str.replace('.', '_')

# Replace "unknown" with NaN in the 'education' column
df['education'] = df['education'].replace('unknown', np.nan)

# For credit_default column, Convert to boolean data type:
df['credit_default'] = np.where(df['credit_default'] == 'yes', 1, 0).astype(bool)

# For mortgage column, Convert to boolean data type:
df['mortgage'] = np.where(df['mortgage'] == 'yes', 1, 0).astype(bool)

In [9]:
# for previous_outcome, Convert to boolean data type:
# Convert the 'previous_outcome' column to boolean (1 if "success", 0 otherwise)
df['previous_outcome'] = np.where(df['previous_outcome'] == 'success', 1, 0).astype(bool)

# For campaign_outcome column, Convert to boolean data type: 1 if "yes", otherwise 0
df['campaign_outcome'] = np.where(df['campaign_outcome'] == 'yes', 1, 0).astype(bool)

In [13]:
# For last_contact_date, Create from a combination of day, month, and a newly created year column (which should have a value of 2022);
# Format = "YYYY-MM-DD"

# Create the 'year' column with the value 2022
df['year'] = 2022

# Combine 'year', 'month', and 'day' to create 'last_contact_date'
# Convert 'day' to string before concatenation
# Changed the format string to use %b for abbreviated month name
df['last_contact_date'] = pd.to_datetime(df['year'].astype(str) + '-' + df['month'] + '-' + df['day'].astype(str), format='%Y-%b-%d').dt.date

# Drop the temporary 'year' column and the original 'month' and 'day' columns
df = df.drop(['year', 'month', 'day'], axis=1)

In [15]:
df.head()

Unnamed: 0,client_id,age,job,marital,education,credit_default,mortgage,contact_duration,number_contacts,previous_campaign_contacts,previous_outcome,cons_price_idx,euribor_three_months,campaign_outcome,last_contact_date
0,0,56,housemaid,married,basic_4y,False,False,261,1,0,False,93.994,4.857,False,2022-05-13
1,1,57,services,married,high_school,False,False,149,1,0,False,93.994,4.857,False,2022-05-19
2,2,37,services,married,high_school,False,True,226,1,0,False,93.994,4.857,False,2022-05-23
3,3,40,admin_,married,basic_6y,False,False,151,1,0,False,93.994,4.857,False,2022-05-27
4,4,56,services,married,high_school,False,False,307,1,0,False,93.994,4.857,False,2022-05-03


In [17]:
client_col = ['client_id', 'age', 'job', 'marital', 'education', 'credit_default', 'mortgage']  # Include all the columns from your existing df, as indicated in your provided example.

# Create the client_df DataFrame
client_df = df[client_col]

# --- Export to CSV ---
client_df.to_csv('client.csv', index=False)

client_df.head()

Unnamed: 0,client_id,age,job,marital,education,credit_default,mortgage
0,0,56,housemaid,married,basic_4y,False,False
1,1,57,services,married,high_school,False,False
2,2,37,services,married,high_school,False,True
3,3,40,admin_,married,basic_6y,False,False
4,4,56,services,married,high_school,False,False


In [19]:
campaign_col = ['client_id', 'number_contacts', 'contact_duration', 'previous_campaign_contacts',
                'previous_outcome', 'campaign_outcome', 'last_contact_date']

campaign_df = df[campaign_col]
campaign_df.to_csv('campaign.csv', index=False)
campaign_df.head()

Unnamed: 0,client_id,number_contacts,contact_duration,previous_campaign_contacts,previous_outcome,campaign_outcome,last_contact_date
0,0,1,261,0,False,False,2022-05-13
1,1,1,149,0,False,False,2022-05-19
2,2,1,226,0,False,False,2022-05-23
3,3,1,151,0,False,False,2022-05-27
4,4,1,307,0,False,False,2022-05-03


In [21]:
econs_col = ['client_id', 'cons_price_idx', 'euribor_three_months']
economics_df = df[econs_col]
economics_df.to_csv('economics.csv', index=False)
economics_df.head()

Unnamed: 0,client_id,cons_price_idx,euribor_three_months
0,0,93.994,4.857
1,1,93.994,4.857
2,2,93.994,4.857
3,3,93.994,4.857
4,4,93.994,4.857
