In [1]:
#import libraries
import pandas as pd

In [2]:
#import data
filePath = "../data/shopping_data.csv"
shoppingDF = pd.read_csv(filePath)
shoppingDF.head()

Unnamed: 0,CustomerID,Card Member,Age,Annual Income,Spending Score (1-100)
0,1,Yes,19.0,15000,39.0
1,2,Yes,21.0,15000,81.0
2,3,No,20.0,16000,6.0
3,4,No,23.0,16000,77.0
4,5,No,31.0,17000,40.0


## Questions for Data Preparation
Unsupervised learning doesn't have a clear outcome or target variable like supervised learning, but it is used to find patterns. By properly preparing the data, we can select features that help us find patterns or groups.

Before we begin, consider these questions:

- What knowledge do we hope to glean from running an unsupervised learning model on this dataset? It's a shopping dataset, so we can group together shoppers based on spending habits.

- What data is available? What type? What is missing? What can be removed? First, account for the data you have. After all, you can't extract knowledge without data. We can use the `columns` method and output the columns.


- Is the data in a format that can be passed into an unsupervised learning model?
- Can I quickly hand off this data for others to use?

In [3]:
shoppingDF.columns

Index(['CustomerID', 'Card Member', 'Age', 'Annual Income',
       'Spending Score (1-100)'],
      dtype='object')

In [4]:
shoppingDF.dtypes

CustomerID                  int64
Card Member                object
Age                       float64
Annual Income               int64
Spending Score (1-100)    float64
dtype: object

In [5]:
#check for null values
for column in shoppingDF.columns:
    print(f"Column {column} has {shoppingDF[column].isnull().sum()} null values")



Column CustomerID has 0 null values
Column Card Member has 2 null values
Column Age has 2 null values
Column Annual Income has 0 null values
Column Spending Score (1-100) has 1 null values


In [6]:
#drop null value rows because that is a negligiable number
shoppingDF=shoppingDF.dropna()

In [7]:
#find duplicate entries and drop them
print(f"Duplicate entries: {shoppingDF.duplicated().sum()}")

Duplicate entries: 0


In [10]:
#remove the customer id column
shoppingDF.drop(columns=["CustomerID"], inplace=True)

In [12]:
#Is the data in a format that can be passed into an unsupervised learning model?

#transform string data to numerical data
def transform_string(member):
    if member == "Yes":
        return 1
    else:
        return 0

shoppingDF["Card Member"] = shoppingDF["Card Member"].apply(transform_string)
shoppingDF.head()

Unnamed: 0,Card Member,Age,Annual Income,Spending Score (1-100)
0,1,19.0,15000,39.0
1,1,21.0,15000,81.0
2,0,20.0,16000,6.0
3,0,23.0,16000,77.0
4,0,31.0,17000,40.0


In [16]:
#convert column names to not have spaces
shoppingDF.rename(columns={"Card Member":"CardMember", "Annual Income": "AnnualIncome", "Spending Score (1-100)":"SpendingScore(1-100)"}, inplace=True)
shoppingDF.head()

Unnamed: 0,CardMember,Age,AnnualIncome,SpendingScore(1-100)
0,1,19.0,15000,39.0
1,1,21.0,15000,81.0
2,0,20.0,16000,6.0
3,0,23.0,16000,77.0
4,0,31.0,17000,40.0


In [18]:
#save cleaned data as a json object
filePath = "../data/cleaned_shopping_data.json"
shoppingDF.to_json(filePath)