In [7]:
# import dependencies
import pandas as pd

In [8]:
# load dataset into datframe
file_path = "shopping_data.csv"
df_shopping= pd.read_csv(file_path)
df_shopping.head()

Unnamed: 0,CustomerID,Card Member,Age,Annual Income,Spending Score (1-100)
0,1,Yes,19.0,15000,39.0
1,2,Yes,21.0,15000,81.0
2,3,No,20.0,16000,6.0
3,4,No,23.0,16000,77.0
4,5,No,31.0,17000,40.0


# Questions for Data Preparation

Unsupervised learning doesn't have a clear outcome or target variable like supervised learning, but it is used to find patterns. By properly preparing the data, we can select features that help us find patterns or groups.

Before we begin, consider these questions:

- What knowledge do we hope to glean from running an unsupervised learning model on this dataset?
- What data is available? What type? What is missing? What can be removed?
- Is the data in a format that can be passed into an unsupervised learning model?
- Can I quickly hand off this data for others to use?

# # Part 1 Data Selection

## What knowledge do we hope to glean from running an unsupervised learning model on this dataset?
group together shoppers based on spending habits

In [9]:
# What data is avaible? check out the columns
# Columns
df_shopping.columns

Index(['CustomerID', 'Card Member', 'Age', 'Annual Income',
       'Spending Score (1-100)'],
      dtype='object')

In [10]:
# What Type of data is available? 
# use dtypes to list DF types
df_shopping.dtypes


CustomerID                  int64
Card Member                object
Age                       float64
Annual Income               int64
Spending Score (1-100)    float64
dtype: object

In [14]:
# What data is missing? check for columns with no data since results will not be accurate if large chunks are missing
# use insull() to check and loop through columns
for column in df_shopping.columns:
        print(f"Column {column} has {df_shopping[column].isnull().sum()} null values")


Column CustomerID has 0 null values
Column Card Member has 2 null values
Column Age has 2 null values
Column Annual Income has 0 null values
Column Spending Score (1-100) has 1 null values


In [15]:
#What data can/should be removed
# drop null rows only since only a few rows have null data points
df_shopping = df_shopping.dropna()

In [17]:
# check and remove duplicates
print (f"Duplicate entries: {df_shopping.duplicated().sum()}")

Duplicate entries: 0


In [18]:
# remove data that doesn't tell anything
# remove CustomerID
df_shopping.drop(columns =['CustomerID'], inplace =True)
df_shopping.head()

Unnamed: 0,Card Member,Age,Annual Income,Spending Score (1-100)
0,Yes,19.0,15000,39.0
1,Yes,21.0,15000,81.0
2,No,20.0,16000,6.0
3,No,23.0,16000,77.0
4,No,31.0,17000,40.0


# PART 2 Data Processing
## Is the data in a format that can be passed into an unsupervised learning model?
exploring data to see what kind of insights and analysis you might glean, make sure data is set up for unsupervised:
- null values are handled
- only numerical data stored
- values are scaled ie manipulated to ensure that the variance between numbers wont skew results

In [20]:
# Transform strings to integers
# change Card Members from Y/N to 1/0 and create a function that will convert any "Y" to 1 and anything else to 0
def change_string(member):
    if member == "Yes":
        return 1
    else:
        return 0
df_shopping['Card Member'] = df_shopping['Card Member'].apply(change_string)
df_shopping.head()

Unnamed: 0,Card Member,Age,Annual Income,Spending Score (1-100)
0,1,19.0,15000,39.0
1,1,21.0,15000,81.0
2,0,20.0,16000,6.0
3,0,23.0,16000,77.0
4,0,31.0,17000,40.0


In [21]:
# change scale for Annual Income by dividing by 1,000
df_shopping['Annual Income'] = df_shopping['Annual Income'] /1000
df_shopping.head()

Unnamed: 0,Card Member,Age,Annual Income,Spending Score (1-100)
0,1,19.0,15.0,39.0
1,1,21.0,15.0,81.0
2,0,20.0,16.0,6.0
3,0,23.0,16.0,77.0
4,0,31.0,17.0,40.0


In [25]:
# Reformat columns so no spaces or numbers
df_shopping = df_shopping.rename(columns={"Card Member": "CardMember", "Annual Income": "AnnualIncome", "Spending Score (1-100)": "SpendingScore"})
df_shopping.head()

Unnamed: 0,CardMember,Age,AnnualIncome,SpendingScore
0,1,19.0,15.0,39.0
1,1,21.0,15.0,81.0
2,0,20.0,16.0,6.0
3,0,23.0,16.0,77.0
4,0,31.0,17.0,40.0


# Part 3 Data Transformation
## Is data quickly trasnferabble to other users?
 data now needs to be transformed back into a more user-friendly format. It would be nice if everyone was as great with DataFrames as you two; unfortunately, that is not the case. You'll want to convert the final product into a common data type like CSV or Excel files

In [26]:
# Saving cleaned data
file_path = "shopping_data_cleaned.csv"
df_shopping.to_csv(file_path, index=False)