### 4.9. Task Submission - Intro to Data Viz - part 1.1 - Google Colab

#### Directions

##### Part 1.1

Download the customer data set and add it to your “Original Data” folder.

Create a new notebook in your “Scripts” folder for part 1 of this task.

Import your analysis libraries, as well as your new customer data set as a dataframe.

Wrangle the data so that it follows consistent logic; for example, rename columns with illogical names and drop columns that don’t add anything to your analysis.

Complete the fundamental data quality and consistency checks you’ve learned throughout this Achievement; for example, check for and address missing values and duplicates, and convert any mixed-type data.

##### Part 1.2

Combine your customer data with the rest of your prepared Instacart data. (Hint: Make sure the key columns are the same data type!)

Ensure your notebook contains logical titles, section headings, and descriptive code comments.

Export this new dataframe as a pickle file so you can continue to use it in the second part of this task.

Save your notebook so that you can send it to your tutor for review after completing part 2.

### Part 1.1

#### Importing libraries and files

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

In [None]:
path1 = "/content/drive/MyDrive/Career Foundry"

In [None]:
# import customers.csv
df_cust = pd.read_csv(os.path.join(path, 'customers.csv'), index_col = False)

In [None]:
df_cust.shape

(206209, 10)

In [None]:
df_cust.head()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


### Data Wrangling and Consistency Checks

Collecting → Importing data from CSV, Excel, databases, APIs, etc.

Cleaning → Handling missing values, duplicates, typos, and inconsistencies.

Transforming → Converting data types, normalizing text, creating new calculated columns.

Restructuring → Merging, joining, pivoting, or reshaping data into a usable format.

Validating → Checking that the data makes sense (e.g., no negative ages, dates in correct ranges).

Enriching (optional) → Adding extra data from other sources.

### As per the course, the meanings follow:

### Data Wrangling

Dropping Columns

Removing Columns

Changing Data Types

Transposing Data

### Consistency Checks

Finding and Addressing mixed data types

Finding and Addressing missing values

Finding and Addressing duplicate values

#### renaming columns

In [None]:
# function synthax : df.rename(columns = {'old_name' : 'new_name'}, inplace = True)

df_cust.rename(columns = {
    'Surnam' : 'surname',
    'First Name' : 'first_name',
    'Gender' : 'gender',
    'STATE' : 'state',
    'Age' : 'age'
}, inplace = True)

# inplace = true means it will overwrite the original column name with the new one rather than creating a copy.

In [None]:
df_cust.head()

Unnamed: 0,user_id,first_name,surname,gender,state,age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


#### checking datatypes

In [None]:
df_cust.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 10 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       206209 non-null  int64 
 1   first_name    194950 non-null  object
 2   surname       206209 non-null  object
 3   gender        206209 non-null  object
 4   state         206209 non-null  object
 5   age           206209 non-null  int64 
 6   date_joined   206209 non-null  object
 7   n_dependants  206209 non-null  int64 
 8   fam_status    206209 non-null  object
 9   income        206209 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 15.7+ MB


#### Note:
this line of code will also say the datatypes:
````python
df_cust.dtypes

In [None]:
#checking whether a dataframe contains any mixed-type columns:

for col in df_cust.columns:
    weird = (df_cust[col].map(type) != type(df_cust[col].iloc[0]))
    if weird.any():
        print(col)

user_id
first_name
age
n_dependants
income


In [None]:
# This is an efficient way of changing data types, by creating a dictionary:

df_cust = df_cust.astype({
    'user_id' : 'str',
    'first_name': 'str',
    'age': 'int64',
    'n_dependants': 'int64',
    'income' : 'int64'
})

In [None]:
df_cust.describe()

Unnamed: 0,age,n_dependants,income
count,206209.0,206209.0,206209.0
mean,49.501646,1.499823,94632.852548
std,18.480962,1.118433,42473.786988
min,18.0,0.0,25903.0
25%,33.0,0.0,59874.0
50%,49.0,1.0,93547.0
75%,66.0,3.0,124244.0
max,81.0,3.0,593901.0


In [None]:
df_cust.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 10 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       206209 non-null  object
 1   first_name    206209 non-null  object
 2   surname       206209 non-null  object
 3   gender        206209 non-null  object
 4   state         206209 non-null  object
 5   age           206209 non-null  int64 
 6   date_joined   206209 non-null  object
 7   n_dependants  206209 non-null  int64 
 8   fam_status    206209 non-null  object
 9   income        206209 non-null  int64 
dtypes: int64(3), object(7)
memory usage: 15.7+ MB


#### Handling missing values

In [None]:
df_cust.isnull().sum()

user_id         0
first_name      0
surname         0
gender          0
state           0
age             0
date_joined     0
n_dependants    0
fam_status      0
income          0
dtype: int64

````python
df_cust['first_name'].value_counts(dropna = False)

There is no need for this but if there were missing values, this would create a dataframe to see where those would be:
````python
df_nan = df_cust[df_cust['first_name'].isnull() == True]
df_nan

#### checking for duplicates

In [None]:
df_dups = df_cust[df_cust.duplicated()]
df_dups

Unnamed: 0,user_id,first_name,surname,gender,state,age,date_joined,n_dependants,fam_status,income


No duplicates were found. Good.

In [None]:
# exporting clean cust.csv
path = "/content/drive/MyDrive/Career Foundry/customers_clean.csv"
df_cust.to_csv(path, index=False)
# with google colab, exporting to a specific google drive folder demands writing the path again. interesting.

In [None]:
#uploading file to merge:
ords_prods_merge = pd.read_pickle(rf'{path1}/ords_prods_merge_groups.pkl')

In [None]:
# exporting it to parquet (lighter and faster):
path2 = "/content/drive/MyDrive/Career Foundry/ords_prods_merge_groups.parquet"
ords_prods_merge.to_parquet(path2, engine="pyarrow", index=False)

### Part 1.2 follows with Kaggle.