# Santander Product Recommendation: Data Wrangling/Cleaning

For this project, I decided to go with the [Santander Product Recommendation Kaggle competition](https://www.kaggle.com/c/santander-product-recommendation). By doing so, it makes the data wrangling process fairly straightforward. All that needed to be done to acquire the data was downloading the zip file associated with the competition. It included 3 csv files: A training dataset, a testing dataset and a sample submission. For now, I will only be using the training dataset.

Since the data was pretty much wrangled for me, I will use this notebook mostly for data cleaning.

In [4]:
#Import packages
import pandas as pd
import numpy as np

In [5]:
#Read in data
#Limit rows to 7 million
df = pd.read_csv('Data/train_ver2.csv', nrows=7000000, error_bad_lines=False)

  interactivity=interactivity, compiler=compiler, result=result)


This dataset is very large. If I were to keep it at 7 million rows, speed would suffer significantly. Therefore, I will take a random sample of those 7 million and keep the dataset at 1 million.

In [6]:
#Take a random sample of 1 million
df = df.sample(1000000, random_state=1)

I need to rename the columns, because they are mostly in Spanish. I decided to do this so that I could keep things easy to understand and follow along with.

In [7]:
#Rename columns
df.columns = ['date', 'customer_code', 'employee_index', 'customer_country', 'sex', 'age', 'first_contract_date', 
              'new_customer_index', 'customer_seniority', 'primary_customer_index', 'last_date_primary_customer', 
              'customer_type', 'customer_relation_type', 'residence_index', 'foreign_index', 'spouse_index', 
              'customer_join_channel', 'deceased_index', 'address_type', 'province_code', 'province_name', 
              'activity_index', 'household_gross_income', 'segmentation', 'savings_account', 'gurantees', 
              'current_accounts', 'derivada_account', 'payroll_account', 'junior_account','mas_particular_account',
              'particular_account', 'particular_plus_account', 'shortterm_deposits', 'mediumterm_deposits', 
              'longterm_deposits', 'online_account', 'funds', 'mortgage', 'pensions', 'loans', 'taxes', 'credit_card', 
              'securities', 'home_account', 'payroll', 'pensions_2', 'direct_debit']

In [8]:
#Observe what we have so far
df.head()

Unnamed: 0,date,customer_code,employee_index,customer_country,sex,age,first_contract_date,new_customer_index,customer_seniority,primary_customer_index,...,mortgage,pensions,loans,taxes,credit_card,securities,home_account,payroll,pensions_2,direct_debit
5444043,2015-08-28,283524,N,ES,V,45,2001-10-17,0.0,166,1.0,...,0,0,0,0,0,0,0,1.0,1.0,1
3989833,2015-07-28,807725,N,ES,H,30,2008-10-22,0.0,81,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
900513,2015-02-28,383639,N,ES,V,46,2002-09-30,0.0,154,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
6966928,2015-10-28,137185,N,ES,V,69,1999-06-30,0.0,196,1.0,...,0,0,0,0,0,1,0,0.0,0.0,0
2171286,2015-04-28,216607,N,ES,V,44,2001-01-19,0.0,174,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0


In [9]:
#Check number of rows
df.customer_code.count()

1000000

By now, we have successfully changed the column names and limited our dataset to 1 million. Here is where the real data cleaning will begin.

### Data Types

I will begin by fixing the data types for various rows. Right now, everything is marked as an 'object' type. What we can do is change columns that contain integers with pd.to_numeric(). I'll start with some columns that clearly need to be marked as such.

I also want to change the 'date' column to a datetime object. After that, I will create a new column 'month' that can help us find additional insights later on in our analysis.

In [10]:
#Fix data types
df.age = pd.to_numeric(df.age, errors='coerce')
df.date = pd.to_datetime(df.date, format="%Y-%m-%d", errors='coerce')
df.household_gross_income = pd.to_numeric(df.household_gross_income, errors='coerce')
df.customer_seniority = pd.to_numeric(df.customer_seniority, errors="coerce")
df.first_contract_date = pd.to_datetime(df.first_contract_date, format="%Y-%m-%d", errors='coerce')

#Create month column for possible insights
df['month'] = pd.DatetimeIndex(df.date).month

In [11]:
#Confirm data types were fixed
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 5444043 to 1016900
Data columns (total 49 columns):
date                          1000000 non-null datetime64[ns]
customer_code                 1000000 non-null int64
employee_index                996089 non-null object
customer_country              996089 non-null object
sex                           996085 non-null object
age                           996089 non-null float64
first_contract_date           996089 non-null datetime64[ns]
new_customer_index            996089 non-null float64
customer_seniority            996089 non-null float64
primary_customer_index        996089 non-null float64
last_date_primary_customer    1522 non-null object
customer_type                 985073 non-null object
customer_relation_type        985073 non-null object
residence_index               996089 non-null object
foreign_index                 996089 non-null object
spouse_index                  144 non-null object
customer_join_cha

### Null Values

We need to get rid of all NaN values in the dataset. df.isnull().sum() will help us do that by giving us a frame of reference. I will refer back to that output from time to time to make sure all of the NaNs are taken care of.

In [12]:
#See which columns have null values (And how many)
df.isnull().sum()

date                               0
customer_code                      0
employee_index                  3911
customer_country                3911
sex                             3915
age                             3911
first_contract_date             3911
new_customer_index              3911
customer_seniority              3911
primary_customer_index          3911
last_date_primary_customer    998478
customer_type                  14927
customer_relation_type         14927
residence_index                 3911
foreign_index                   3911
spouse_index                  999856
customer_join_channel          16985
deceased_index                  3911
address_type                    3911
province_code                   9061
province_name                   9061
activity_index                  3911
household_gross_income        179144
segmentation                   17143
savings_account                    0
gurantees                          0
current_accounts                   0
d

In [13]:
#Drop null rows in date and customer_code columns
df = df.dropna(axis=0, subset=['date', 'customer_code'])

#Drop columns with more than 50% null values
df = df.drop(['last_date_primary_customer', 'spouse_index'], axis=1)

In [14]:
#Change null values to median date
dates=df.loc[:,"first_contract_date"].sort_values().reset_index()
median_date = int(np.median(dates.index.values))
df.loc[df.first_contract_date.isnull(),"first_contract_date"] = dates.loc[median_date,"first_contract_date"]

In [15]:
#A lot of null values here
df.new_customer_index.isnull().sum()

3911

In [16]:
#Check to see how 'new' these null customers are
months_active = df.loc[df["new_customer_index"].isnull(),:].groupby("customer_code", sort=False).size()
months_active.max()

4

In [17]:
#Mark as new customers
df.loc[df.new_customer_index.isnull(), 'new_customer_index'] = 1

In [18]:
#Similar number of null values as new_customer_index
#Most likely same people
df.customer_seniority.isnull().sum()

3911

In [19]:
#We will give them the minimum seniority since we know that
#they are all new customers
df.loc[df.customer_seniority.isnull(),"customer_seniority"] = df.customer_seniority.min()
df.loc[df.customer_seniority <0, "customer_seniority"] = 0

In [20]:
#Drop unneeded columns
#We do not need address type
#We do not need province_code because we have the name of each province in province_name
df.drop(["address_type","province_code"],axis=1,inplace=True)

There seems to be a pattern here. For a lot of the rows, I've noticed that there are around 6800 rows that have null values across a lot of the demographic columns. This tells me that these rows are almost the same ones and are likely filled with bad data. For the purposes of this project, I will go ahead and just drop all rows that have around 6800 null values in the demographic columns.

In [21]:
df = df.dropna(axis=0, subset=['employee_index', 'customer_country', 'sex', 'age', 'primary_customer_index', 
                               'customer_type', 'customer_relation_type', 'residence_index', 'foreign_index', 
                               'customer_join_channel', 'deceased_index', 'activity_index', 'segmentation'])

In [22]:
#Our progress so far
df.isnull().sum()

date                            0
customer_code                   0
employee_index                  0
customer_country                0
sex                             0
age                             0
first_contract_date             0
new_customer_index              0
customer_seniority              0
primary_customer_index          0
customer_type                   0
customer_relation_type          0
residence_index                 0
foreign_index                   0
customer_join_channel           0
deceased_index                  0
province_name                5136
activity_index                  0
household_gross_income     168139
segmentation                    0
savings_account                 0
gurantees                       0
current_accounts                0
derivada_account                0
payroll_account                 0
junior_account                  0
mas_particular_account          0
particular_account              0
particular_plus_account         0
shortterm_depo

### Check Each Column for Strange Values

Now that we have taken care of most of the null values for the demographic data, I decided to take a look at the value counts for each individual column to see if there were any strange values included. I will now work on getting those strange values out of the dataset.

In [23]:
#See what values we have for primary_customer_index
pd.Series([i for i in df.primary_customer_index]).value_counts()

1.0     981573
99.0      1225
dtype: int64

The 'primary_customer_index' column is a prime example of a column that includes bad data. There should only be two values here: 1 or 99.

In [24]:
#Convert to numeric
df.primary_customer_index = pd.to_numeric(df.primary_customer_index, errors='coerce')

In [25]:
pd.Series([i for i in df.primary_customer_index]).value_counts()

1.0     981573
99.0      1225
dtype: int64

Converting to numeric helped, but we still need to get rid of 0.

In [26]:
#Change 0 to most popular (1.0)
df.loc[df.primary_customer_index <= 0, 'primary_customer_index'] = 1

In [27]:
#Confirm it worked
pd.Series([i for i in df.primary_customer_index]).value_counts()

1.0     981573
99.0      1225
dtype: int64

Next we will take a look at the values for 'province_name'. Clearly there is some bad data in here that needs to be taken care of as well.

In [28]:
df.province_name.value_counts()

MADRID                    320201
BARCELONA                  89881
VALENCIA                   47428
SEVILLA                    44680
CORUÑA, A                  31688
MURCIA                     28674
MALAGA                     27513
ZARAGOZA                   24703
ALICANTE                   21913
CADIZ                      21794
PONTEVEDRA                 20283
ASTURIAS                   18613
VALLADOLID                 17318
PALMAS, LAS                16847
BADAJOZ                    14181
BIZKAIA                    13453
TOLEDO                     13286
GRANADA                    12938
SALAMANCA                  12042
CANTABRIA                  11030
CORDOBA                    10607
CACERES                     9715
HUELVA                      9383
CIUDAD REAL                 8892
BALEARS, ILLES              8398
ALBACETE                    8114
CASTELLON                   7664
BURGOS                      6924
GIRONA                      6430
NAVARRA                     6404
TARRAGONA 

In [29]:
#Change type to string
df.province_name = df.province_name.astype(str)

#Drop null values
df = df.dropna(axis=0, subset=['province_name'])

#Create a subset that includes rows that contain only alphabetic characters
subset = df.province_name.str.isalpha()

#Update the dataframe to include the subset
df = df[subset]

#Check to make sure everything worked
df.province_name.value_counts()

MADRID         320201
BARCELONA       89881
VALENCIA        47428
SEVILLA         44680
MURCIA          28674
MALAGA          27513
ZARAGOZA        24703
ALICANTE        21913
CADIZ           21794
PONTEVEDRA      20283
ASTURIAS        18613
VALLADOLID      17318
BADAJOZ         14181
BIZKAIA         13453
TOLEDO          13286
GRANADA         12938
SALAMANCA       12042
CANTABRIA       11030
CORDOBA         10607
CACERES          9715
HUELVA           9383
ALBACETE         8114
CASTELLON        7664
BURGOS           6924
GIRONA           6430
NAVARRA          6404
TARRAGONA        6388
LUGO             6353
OURENSE          6218
LEON             5825
LERIDA           5676
nan              5136
GIPUZKOA         5133
JAEN             4617
GUADALAJARA      4336
ALMERIA          4311
CUENCA           4219
ZAMORA           3730
PALENCIA         3472
SEGOVIA          3042
AVILA            2829
HUESCA           2788
ALAVA            2780
TERUEL           1612
SORIA            1206
MELILLA   

There seems to be two values that don't need to be there: 'nan' and 'VRCIA'. I will drop those from the dataframe.

In [30]:
#Drop bad values
df = df[df.province_name != 'nan']
df = df[df.province_name != 'VRCIA']

In [31]:
#Check that it worked
df.province_name.value_counts()

MADRID         320201
BARCELONA       89881
VALENCIA        47428
SEVILLA         44680
MURCIA          28674
MALAGA          27513
ZARAGOZA        24703
ALICANTE        21913
CADIZ           21794
PONTEVEDRA      20283
ASTURIAS        18613
VALLADOLID      17318
BADAJOZ         14181
BIZKAIA         13453
TOLEDO          13286
GRANADA         12938
SALAMANCA       12042
CANTABRIA       11030
CORDOBA         10607
CACERES          9715
HUELVA           9383
ALBACETE         8114
CASTELLON        7664
BURGOS           6924
GIRONA           6430
NAVARRA          6404
TARRAGONA        6388
LUGO             6353
OURENSE          6218
LEON             5825
LERIDA           5676
GIPUZKOA         5133
JAEN             4617
GUADALAJARA      4336
ALMERIA          4311
CUENCA           4219
ZAMORA           3730
PALENCIA         3472
SEGOVIA          3042
AVILA            2829
HUESCA           2788
ALAVA            2780
TERUEL           1612
SORIA            1206
MELILLA           687
CEUTA     

In [32]:
#Switching gears back to null values, let's get another progress report
df.isnull().sum()

date                            0
customer_code                   0
employee_index                  0
customer_country                0
sex                             0
age                             0
first_contract_date             0
new_customer_index              0
customer_seniority              0
primary_customer_index          0
customer_type                   0
customer_relation_type          0
residence_index                 0
foreign_index                   0
customer_join_channel           0
deceased_index                  0
province_name                   0
activity_index                  0
household_gross_income     145214
segmentation                    0
savings_account                 0
gurantees                       0
current_accounts                0
derivada_account                0
payroll_account                 0
junior_account                  0
mas_particular_account          0
particular_account              0
particular_plus_account         0
shortterm_depo

Next on our list is fixing 'household_gross_income'. It contains a lot of null values, but instead of just changing them to the median of the entire column, I want to change those null values to the median for each 'province_name'. 

In [33]:
#Group 'household_gross_income' by 'province_name' and get the median income
incomes = df.loc[df.household_gross_income.notnull(),:].groupby("province_name").agg({"household_gross_income":{"MedianIncome":np.median}})

#Sort
incomes.sort_values(by=("household_gross_income","MedianIncome"),inplace=True)

#Reset index
incomes.reset_index(inplace=True)

#Change type
incomes.province_name = incomes.province_name.astype("category", categories=[i for i in df.province_name.unique()],ordered=False)

#Observe results
incomes.head()

  return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
  # This is added back by InteractiveShellApp.init_path()


Unnamed: 0_level_0,province_name,household_gross_income
Unnamed: 0_level_1,Unnamed: 1_level_1,MedianIncome
0,BADAJOZ,62142.54
1,LUGO,63755.73
2,LERIDA,64481.91
3,CASTELLON,66630.3
4,ALICANTE,67420.11


In [34]:
#Group 'province_name' by 'household_gross_income' and get the median
grouped = df.groupby("province_name").agg({"household_gross_income":lambda x: x.median(skipna=True)}).reset_index()

#Merge our main dataframe and the grouped dataframe
new_incomes = pd.merge(df,grouped,how="inner",on="province_name").loc[:, ["province_name","household_gross_income_y"]]

#Rename columns
new_incomes = new_incomes.rename(columns={"household_gross_income_y":"household_gross_income"}).sort_values("household_gross_income").sort_values("province_name")

#Sort
df.sort_values("province_name",inplace=True)

#Reset index
df = df.reset_index()
new_incomes = new_incomes.reset_index()

In [35]:
#Finally, change null values to reflect median income by 'province_name'
df.loc[df.household_gross_income.isnull(),"household_gross_income"] = new_incomes.loc[df.household_gross_income.isnull(),"household_gross_income"].reset_index()
df.loc[df.household_gross_income.isnull(),"household_gross_income"] = df.loc[df.household_gross_income.notnull(),"household_gross_income"].median()
df.sort_values(by="date",inplace=True)

In [36]:
#Getting closer
df.isnull().sum()

index                       0
date                        0
customer_code               0
employee_index              0
customer_country            0
sex                         0
age                         0
first_contract_date         0
new_customer_index          0
customer_seniority          0
primary_customer_index      0
customer_type               0
customer_relation_type      0
residence_index             0
foreign_index               0
customer_join_channel       0
deceased_index              0
province_name               0
activity_index              0
household_gross_income      0
segmentation                0
savings_account             0
gurantees                   0
current_accounts            0
derivada_account            0
payroll_account             0
junior_account              0
mas_particular_account      0
particular_account          0
particular_plus_account     0
shortterm_deposits          0
mediumterm_deposits         0
longterm_deposits           0
online_acc

Much better! At this point, the rest of the null values are in the 'product' columns. Since there are very few of them(In the grand scheme of our 900k+ dataset), I will go ahead and change those null values to 0, which indicates that the customer has not bought the service.

In [37]:
#Change remaining null values to 0
df.loc[df.payroll_account.isnull(), "payroll_account"] = 0
df.loc[df.junior_account.isnull(), "junior_account"] = 0
df.loc[df.mas_particular_account.isnull(), "mas_particular_account"] = 0
df.loc[df.particular_account.isnull(), "particular_account"] = 0
df.loc[df.particular_plus_account.isnull(), "particular_plus_account"] = 0
df.loc[df.shortterm_deposits.isnull(), "shortterm_deposits"] = 0
df.loc[df.mediumterm_deposits.isnull(), "mediumterm_deposits"] = 0
df.loc[df.longterm_deposits.isnull(), "longterm_deposits"] = 0
df.loc[df.online_account.isnull(), "online_account"] = 0
df.loc[df.funds.isnull(), "funds"] = 0
df.loc[df.mortgage.isnull(), "mortgage"] = 0
df.loc[df.pensions.isnull(), "pensions"] = 0
df.loc[df.loans.isnull(), "loans"] = 0
df.loc[df.taxes.isnull(), "taxes"] = 0
df.loc[df.credit_card.isnull(), "credit_card"] = 0
df.loc[df.securities.isnull(), "securities"] = 0
df.loc[df.home_account.isnull(), "home_account"] = 0
df.loc[df.payroll.isnull(), "payroll"] = 0
df.loc[df.pensions_2.isnull(), "pensions_2"] = 0
df.loc[df.direct_debit.isnull(), "direct_debit"] = 0

In [38]:
df.isnull().sum()

index                      0
date                       0
customer_code              0
employee_index             0
customer_country           0
sex                        0
age                        0
first_contract_date        0
new_customer_index         0
customer_seniority         0
primary_customer_index     0
customer_type              0
customer_relation_type     0
residence_index            0
foreign_index              0
customer_join_channel      0
deceased_index             0
province_name              0
activity_index             0
household_gross_income     0
segmentation               0
savings_account            0
gurantees                  0
current_accounts           0
derivada_account           0
payroll_account            0
junior_account             0
mas_particular_account     0
particular_account         0
particular_plus_account    0
shortterm_deposits         0
mediumterm_deposits        0
longterm_deposits          0
online_account             0
funds         

Now that we have finally gotten rid of all null values, we need to make sure that the unique values for each column are correct and that no bad data is present. I went ahead and checked each column individually. Below are the ones that include data that needs to be fixed.

In [39]:
#Check values for 'new_customer_index
df.new_customer_index.value_counts()

0.0    876398
1.0     24530
Name: new_customer_index, dtype: int64

In [40]:
#Change values to numeric
df.new_customer_index = pd.to_numeric(df.new_customer_index, errors='coerce')

In [41]:
#Check that it worked
df.new_customer_index.value_counts()

0.0    876398
1.0     24530
Name: new_customer_index, dtype: int64

In [42]:
#Change 'customer_type' to numeric and make sure that it worked
df.customer_type = pd.to_numeric(df.customer_type, errors='coerce')
df.customer_type.value_counts()

1.0    900923
3.0         5
Name: customer_type, dtype: int64

In [43]:
#Do the same thing for 'activity_index'
df.activity_index = pd.to_numeric(df.activity_index, errors='coerce')
df.activity_index.value_counts()

0.0    456322
1.0    444606
Name: activity_index, dtype: int64

In [44]:
#Check unique values for 'segmentation'
df.segmentation.value_counts()

02 - PARTICULARES     538804
03 - UNIVERSITARIO    319592
01 - TOP               42532
Name: segmentation, dtype: int64

In [45]:
#Make bad values null, check to make sure it worked
df.loc[df.segmentation == '0', 'segmentation'] = np.nan
df.loc[df.segmentation == '03 - UNIV0', 'segmentation'] = np.nan
df.loc[df.segmentation == '02 - PART- UNIVERSITARIO', 'segmentation'] = np.nan
df.loc[df.segmentation == '03 - UNIVERSIT0', 'segmentation'] = np.nan
df.loc[df.segmentation == '03 - UNIVERSITARTICULARES', 'segmentation'] = np.nan
df.loc[df.segmentation == '02 - PARTI', 'segmentation'] = np.nan
df.loc[df.segmentation == '02 - PARTICULAIO', 'segmentation'] = np.nan
df.segmentation.value_counts()

02 - PARTICULARES     538804
03 - UNIVERSITARIO    319592
01 - TOP               42532
Name: segmentation, dtype: int64

In [46]:
#Drop null values 
df = df.dropna(axis=0, subset=['segmentation'])

In [47]:
#'S' should not be included in 'employee_index'
df.employee_index.value_counts()

N    900248
B       276
A       205
F       198
S         1
Name: employee_index, dtype: int64

In [48]:
#Change 'S' to most popular ('N') for 'employee_index' and make sure it worked
df.loc[df.employee_index == 'S', 'employee_index'] = 'N'
df.employee_index.value_counts()

N    900249
B       276
A       205
F       198
Name: employee_index, dtype: int64

In [49]:
#Convert 'particular_account' to numeric and confirm that it worked
df.particular_account = pd.to_numeric(df.particular_account, errors='coerce')
df.particular_account.value_counts()

0    763098
1    137830
Name: particular_account, dtype: int64

In [50]:
#Do the same thing for 'longterm_deposits'
df.longterm_deposits = pd.to_numeric(df.longterm_deposits, errors='coerce')
df.longterm_deposits.value_counts()

#Drop null values
df = df.dropna(axis=0, subset=['longterm_deposits'])

In [51]:
#And do the same thing for 'online_account'
df.online_account = pd.to_numeric(df.online_account, errors='coerce')
df.online_account.value_counts()

0    817134
1     83794
Name: online_account, dtype: int64

We are finally finished cleaning the data! Now there's just one last step. In order to properly identify customer purchasing trends within the data, we need to create a label for each product and unique month that indicates whether a customer added, dropped or maintained a product that specific billing cycle. We will do this by assigning a numeric ID to each unique time stamp and then match each entry with the one from the previous month. The difference in the indicator value for each will give us the desired value. 

In [52]:
#Before we begin, we need to make the feature columns (products the bank offers) integer values
feature_cols = df.iloc[:, 21:-1].columns.values
for col in feature_cols:
    df[col] = df[col].astype(int)

In [53]:
#Create variable for unique months
unique_months = pd.DataFrame(pd.Series(df.date.unique()).sort_values()).reset_index(drop=True)

# start with month 1, not 0 to match what we already have
unique_months["month_id"] = pd.Series(range(1,1+unique_months.size)) 

unique_months["month_next_id"] = 1 + unique_months["month_id"]

unique_months.rename(columns={0:"date"},inplace=True)

#Merge df with unique_months
df = pd.merge(df,unique_months,on="date")

In [54]:
#Define function for labelling each product as 'Added', 'Dropped' or 'Maintained'
def status_change(x):
    diffs = x.diff().fillna(0)
    label = ["Added" if i==1 \
         else "Dropped" if i==-1 \
         else "Maintained" for i in diffs]
    return label

In [55]:
#Apply the function
df.loc[:, feature_cols] = df.loc[:, [i for i in feature_cols]+["customer_code"]].groupby("customer_code").transform(status_change)

In [56]:
#Since we are only interested in seeing instances of customers adding or dropping products,
#we will get rid of any instance in which a customer 'Maintained' a product
df = pd.melt(df, id_vars=[col for col in df.columns if col not in feature_cols],
            value_vars=[col for col in feature_cols])
df = df.loc[df.value!="Maintained",:]

#What does our dataframe look like now?
df.shape

(55901, 26)

In [57]:
df.head()

Unnamed: 0,index,date,customer_code,employee_index,customer_country,sex,age,first_contract_date,new_customer_index,customer_seniority,...,deceased_index,province_name,activity_index,household_gross_income,segmentation,month,month_id,month_next_id,variable,value
699700,4827636,2015-08-28,423998,N,ES,V,39.0,2003-06-19,0.0,146.0,...,N,MADRID,1.0,74259.78,01 - TOP,8,8,9,savings_account,Dropped
1563725,5347832,2015-08-28,89574,N,ES,V,38.0,1998-03-16,0.0,209.0,...,N,MADRID,1.0,233735.55,02 - PARTICULARES,8,8,9,gurantees,Dropped
1883609,1103787,2015-02-28,984406,N,ES,V,25.0,2011-11-23,0.0,44.0,...,N,MADRID,1.0,45329.46,03 - UNIVERSITARIO,2,2,3,current_accounts,Added
1884087,879806,2015-02-28,343743,N,ES,H,59.0,2002-03-18,0.0,160.0,...,N,MADRID,1.0,140839.83,01 - TOP,2,2,3,current_accounts,Added
1884480,1030418,2015-02-28,1322454,N,ES,H,22.0,2014-10-03,0.0,10.0,...,N,MADRID,1.0,74630.19,03 - UNIVERSITARIO,2,2,3,current_accounts,Dropped


In [58]:
#Save dataframe to csv
df.to_csv("Data/train_ver2_CLEAN", index=False)

We are finally finished with this portion of the project! Now that we have the data clean and tidy, we can move on toward performing some exploratory data analysis, which is what I will primarily focus on in the next notebok.