# First Look at the Data

## Table of Contents
* [Import Libraries](#chapter1)
    * [Import magic autoreload](#section_1_1)
    * [Import the libraries](#section_1_2)
    * [Import custom functions.](#section_1_3)
* [Read the Data](#chapter2)
* [Inspect the Data](#chapter3)
* [Create new variables](chapter4)
* [Save the Data](#chapter5)


## Import Libraries: <a class="anchor" id="chapter1"></a>

Import the magic autoreload extension so that any changes in external python modules are automatically loaded. <a class="anchor" id="section_1_1"></a>

In [1]:
# autoreload 2
%load_ext autoreload
%autoreload 2

Import the libraries we will use in this notebook. <a class="anchor" id="section_1_2"></a>

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

Set the current working directory to the project folder.

In [3]:
os.chdir("C:/Users/migue/OneDrive - NOVAIMS/Data Science/Coding Courses/Machine Learning II/Project")
# wd stands for working directory
wd = os.getcwd()

Import our custom functions. <a class="anchor" id="section_1_3"></a>

In [4]:
# import funcs.py from the functions folder of the Project folder
from functions.funcs import *

# Read the data: <a class="anchor" id="chapter2"></a>

List the files in the directory. 

In [5]:
dfs = load_dfs(wd + "/prof_data/")

Created dataframe [94mBasket[0m for Customer Basket Dataset.csv
Created dataframe [94mInfo[0m for Customer Info Dataset.csv
Created dataframe [94mMapping[0m for Product Mapping Excel File.xlsx
File Project Description and Info.pdf is not a .csv or .xlsx file. Skipping it.


Create dataframes from the csv files. <a class="anchor" id="section_2_1"></a>

In [6]:
# Create a global variable for each dataframe in the dfs dict
for key in dfs.keys():
    globals()[key] = dfs[key]
    print(f"Created global variable {key} with values from dictionary dfs key {key}")

Created global variable Basket with values from dictionary dfs key Basket
Created global variable Info with values from dictionary dfs key Info
Created global variable Mapping with values from dictionary dfs key Mapping


# Inspect the data: <a class="anchor" id="chapter3"></a>

In [7]:
print_cols(Basket, "Basket")
print_cols(Info, "Info")
print_cols(Mapping, "Mapping")

[1mColumns in [94mBasket[0m[1m are: [0m
	-customer_id, invoice_id, list_of_goods

[1mColumns in [94mInfo[0m[1m are: [0m
	-customer_id, customer_name, customer_gender, customer_birthdate, kids_home, 
	-teens_home, number_complaints, distinct_stores_visited, lifetime_spend_groceries, lifetime_spend_electronics, 
	-typical_hour, lifetime_spend_vegetables, lifetime_spend_nonalcohol_drinks, lifetime_spend_alcohol_drinks, lifetime_spend_meat, 
	-lifetime_spend_fish, lifetime_spend_hygiene, lifetime_spend_videogames, lifetime_total_distinct_products, percentage_of_products_bought_promotion, 
	-year_first_transaction, loyalty_card_number, latitude, longitude

[1mColumns in [94mMapping[0m[1m are: [0m
	-product_name, category



In [8]:
print_na_cols(Basket, "Basket")
print_na_cols(Info, "Info")
print_na_cols(Mapping, "Mapping")

[94m[1mBasket[0m has no missing values.

In [94m[1mInfo[0m, the following columns have missing values:
	-Column [94mloyalty_card_number[0m has [91m24175[0m missing values. This equals [91m80.58%[0m of its values.
	
[94m[1mMapping[0m has no missing values.



In [9]:
# check for duplicates
print(f"Number of duplicates in {blue}Basket{end}: {red}{Basket.duplicated().sum()}{end}")
print(f"Number of duplicates in {blue}Info{end}: {red}{Info.duplicated().sum()}{end}")
print(f"Number of duplicates in {blue}Mapping{end}: {red}{Mapping.duplicated().sum()}{end}")

Number of duplicates in [94mBasket[0m: [91m0[0m
Number of duplicates in [94mInfo[0m: [91m0[0m
Number of duplicates in [94mMapping[0m: [91m1[0m


In [10]:
# check for duplicate customer_id in Info
print(f"Number of duplicate customer_id in {blue}Info{end}: {red}{Info.duplicated(subset='customer_id').sum()}{end}")
print(f"Number of duplicate coordinates in {blue}Info{end}: {red}{Info.duplicated(subset=['latitude', 'longitude']).sum()}{end}")


Number of duplicate customer_id in [94mInfo[0m: [91m0[0m
Number of duplicate coordinates in [94mInfo[0m: [91m0[0m


In [11]:
# print the duplicate rows in mapping with their index
print(Mapping[Mapping.duplicated(keep=False)])

    product_name    category
128    asparagus  vegetables
135    asparagus  vegetables


In [12]:
print_dup_cols(Basket, "Basket")
print_dup_cols(Info, "Info")
print_dup_cols(Mapping, "Mapping")

In [94m[1mBasket[0m, the following columns have duplicate values:
	-Column [94mcustomer_id[0m has [91m69701[0m duplicate values. This equals [91m87.13%[0m of its values.
	-Column [94minvoice_id[0m has [91m251[0m duplicate values. This equals [91m0.31%[0m of its values.
	-Column [94mlist_of_goods[0m has [91m349[0m duplicate values. This equals [91m0.44%[0m of its values.
	
In [94m[1mInfo[0m, the following columns have duplicate values:
	-Column [94mcustomer_name[0m has [91m557[0m duplicate values. This equals [91m1.86%[0m of its values.
	-Column [94mcustomer_gender[0m has [91m29998[0m duplicate values. This equals [91m99.99%[0m of its values.
	-Column [94mcustomer_birthdate[0m has [91m10[0m duplicate values. This equals [91m0.03%[0m of its values.
	-Column [94mkids_home[0m has [91m29989[0m duplicate values. This equals [91m99.96%[0m of its values.
	-Column [94mteens_home[0m has [91m29990[0m duplicate values. This equals [91m99.97%[0m

In [13]:
print_inf_cols(Basket, "Basket")
print_inf_cols(Info, "Info")
print_inf_cols(Mapping, "Mapping")

[94m[1mBasket[0m has no infinite values.

In [94m[1mInfo[0m, the following columns have infinite values:
	-Column [94mtypical_hour[0m has [91m2[0m infinite values. This equals [91m0.01%[0m of its values.
	-Column [94mlifetime_spend_videogames[0m has [91m226[0m infinite values. This equals [91m0.75%[0m of its values.
	
[94m[1mMapping[0m has no infinite values.



Let's inspect the rows that have infinite values in the columns.

In [14]:
# print rows that have inf values
Info[Info.isin([np.inf, -np.inf]).any(axis = 1)].shape

(226, 24)

We can see that all the customers with infinite values appear to be supermarkets. Let's check the number of supermarkets in the dataset.

In [15]:
# check number of supermarkets in the dataset (rows that have supermarket in the customer_name)
Info[Info.customer_name.str.contains("Supermarket")].shape

(226, 24)

In [16]:
# compare the two dataframes to see if they are the same: 
Info[Info.customer_name.str.contains("Supermarket")].equals(Info[Info.isin([np.inf, -np.inf]).any(axis = 1)])

True

We can now see that there are 226 supermarkets on the dataset and that these are the customers with infinite values. 
We will look into these customers in more detail.

In [17]:
supermarkets = Info[Info.customer_name.str.contains("Supermarket")].customer_name.unique()

In [18]:
supermarkets.shape

(173,)

# Create new variables <a class="anchor" id="chapter4"></a>
Using information in the dataset we can create new variables that might be useful for our analysis.

Using all the columns in the dataset with *lifetime_spend* in their name, we can create a new variable that represents the total amount of money they spent.

In [19]:
# create new column with total amount spent by each customer by summing all lifetime_spent columns
Info['lifetime_spent'] = Info[[col for col in Info.columns if 'lifetime_spend' in col]].sum(axis=1)

Create a new variable that represents the **total amount of money spent per year**. We can do this by dividing the total amount of money spent by the number of years they have been a customer.

In [20]:
# create variables with the education level of the curstomers, None, Bsc, Msc and PhD. These will be 0, 1, 2 and 3 respectively.
# The information regarding this is in the customer_name column of the Info dataframe
Info['education_level'] = create_educ_level(Info)

In [21]:
Info['education_level'].value_counts()

0    18568
1     3815
3     3810
2     3807
Name: education_level, dtype: int64

## Save the Data <a class="anchor" id="chapter5"></a>
This saves the current state of the data to new csv files so that we can use them in other notebooks.

In [24]:
# save all three dataframes to csv files in a new folder named treated in the data folder of the project
save_to_csv(dfs, wd + "/data/")