In [1]:
import pandas as pd

In [3]:
file_path ="Resources/shopping_data.csv"
df_shopping = pd.read_csv(file_path,encoding ='ISO-8859-1')
df_shopping.head()

Unnamed: 0,CustomerID,Card Member,Age,Annual Income,Spending Score (1-100)
0,1,Yes,19.0,15000,39.0
1,2,Yes,21.0,15000,81.0
2,3,No,20.0,16000,6.0
3,4,No,23.0,16000,77.0
4,5,No,31.0,17000,40.0


## Questions for Data Preparation
Unsupervised learning doesn't have a clear outcome or target variable like supervised learning, but it is used to find patterns. By properly preparing the data, we can select features that help us find patterns or groups.

Before we begin, consider these questions:

    1.What knowledge do we hope to glean from running an unsupervised learning model on this dataset?
    2.What data is available? What type? What is missing? What can be removed?
    3.Is the data in a format that can be passed into an unsupervised learning model?
    4.Can I quickly hand off this data for others to use?
    
#### What knowledge do we hope to glean from running an unsupervised learning model on this dataset?
It's a shopping dataset, so we can group together shoppers based on spending habits.

In [4]:
#What data is available?
#check waht data is available, but looking at the columns
df_shopping.columns

Index(['CustomerID', 'Card Member', 'Age', 'Annual Income',
       'Spending Score (1-100)'],
      dtype='object')

In [5]:
#Now that we know what data we have, we can start thinking about possible analysis.
#For example, data points for features like Age and Annual Income might appear in our end result as groupings or clusters. 
#However, there are no data points for items purchased, so our algorithms cannot discover related patterns.

In [6]:
#What type of data is available?
#check column data types
df_shopping.dtypes

CustomerID                  int64
Card Member                object
Age                       float64
Annual Income               int64
Spending Score (1-100)    float64
dtype: object

In [7]:
#From above data type, we can Card member is Object data type. for ML we can use numeric data. 
#So we will have to either remove it or transform it


In [8]:
#What data is missing?
#check for null or 0
#If you initially had hoped to produce an outcome using a type of data, 
#but it turned out more than 80% of those rows are empty, then the results won't be very accurate!
#For example, return to our Age and Income groups: If it turns out there are 1,200 rows without any Age data points,
#then we clearly can't use that column in our model. There is no set cutoff for missing data—that decision is left up to you, the analyst, and must be made based on your understanding of the business needs.

In [10]:
#find null values
for column in df_shopping.columns:
    print(f"Column {column} has {df_shopping[column].isnull().sum()} null values")

Column CustomerID has 0 null values
Column Card Member has 2 null values
Column Age has 2 null values
Column Annual Income has 0 null values
Column Spending Score (1-100) has 1 null values


In [11]:
#There will be a few rows with missing values that we'll need to handle.
#The judgement call will be to either remove these rows or decide that the dataset is not suitable for our model. 
#In this case, we'll proceed with handling these values because they are a small percentage of the overall data.

In [12]:
#What data can be removed?
#ou have begun to explore the data and have taken a look at null values.
#Next, determine if the data can be removed. Consider: Are there string columns that we can't use? Are there columns with excessive null data points? Was our decision to handle missing values to just remove them?

#In our example, there are no string type columns, and we made the decision that only a few rows have null data points, 
#but not enough to remove a whole column's worth.

#Rows of data with null values can be removed with the dropna() method 

In [17]:
# Drop null rows
df_shopping = df_shopping.dropna()
df_shopping.head()

Unnamed: 0,CustomerID,Card Member,Age,Annual Income,Spending Score (1-100)
0,1,Yes,19.0,15000,39.0
1,2,Yes,21.0,15000,81.0
2,3,No,20.0,16000,6.0
3,4,No,23.0,16000,77.0
4,5,No,31.0,17000,40.0


In [15]:
#duplciates should also be removed as duplicate doesn't give us any new information, as well as they could skew our results
#use duplicated().sum() to find duplciates
print(f"Duplicated entries: {df_shopping.duplicated().sum()}")

Duplicated entries: 0


In [18]:
#We also can remove data that doesn’t tell us anything interesting. 
#Knwoing this, in the this data we can see CustomerId doesn't tell us any thing shopping behavior
#we can remove the column 
df_shopping.drop(columns=["CustomerID"],inplace=True)
df_shopping.head()




Unnamed: 0,Card Member,Age,Annual Income,Spending Score (1-100)
0,Yes,19.0,15000,39.0
1,Yes,21.0,15000,81.0
2,No,20.0,16000,6.0
3,No,23.0,16000,77.0
4,No,31.0,17000,40.0


### Data processing or trasnformation to make it ready for unupervised model
The next step is to move on from what you (the user) want to get out of your data and on to what the unsupervised model needs out of the data.

Recall that in the data selection step, you, as the user, are exploring the data to see what kind of insights and analysis you might glean. You reviewed the columns available and the data types stored, and determined if there were missing values.

For data processing, the focus is on making sure the data is set up for the unsupervised learning model, which requires the following:

    1.Null values are handled.
    2.Only numerical data is used.
    3.Values are scaled. In other words, data has been manipulated to ensure that the variance between the numbers won't skew results.
    
**Note:** Recall that when features have different scales, they can have a disproportionate impact on the model. The unscaled value could lead to messy graphs. Therefore, it is important to understand when to scale and normalize data. For example, if four columns of data are single digits, and the fifth column is in the millions, we would need to scale the fifth column to align the other four

In [19]:
#Is the data in a format that can be passed into an unsupervised learning model?
#We saw before that all our data had the correct type for each column; however, 
#we know that our model can't have strings passed into it.

#To make sure we can use our string data, we'll transform our strings of Yes and No from the Card Member column to 1 and 0, respectively, by creating a function that will convert Yes to a 1 and anything else to 0.

#The function will then be run on the whole column with the .apply metho 

In [21]:
#Trasnforming string column using a user defined function
def change_string(member):
    if member =="Yes":
        return 1
    else:
        return 0
df_shopping["Card Member"] = df_shopping["Card Member"].apply(change_string)
df_shopping.head()
    

Unnamed: 0,Card Member,Age,Annual Income,Spending Score (1-100)
0,1,19.0,15000,39.0
1,1,21.0,15000,81.0
2,0,20.0,16000,6.0
3,0,23.0,16000,77.0
4,0,31.0,17000,40.0


In [24]:
#skildrill Reformat the names of the columns so they contain no spaces or numbers.
df_shopping = df_shopping.rename(columns = {"Card Member":"Card_Member","Annual Income": "Annual_Income","Spending Score (1-100)":"Spending_Score"})
df_shopping.head()

Unnamed: 0,Card_Member,Age,Annual_Income,Spending_Score
0,1,19.0,15000,39.0
1,1,21.0,15000,81.0
2,0,20.0,16000,6.0
3,0,23.0,16000,77.0
4,0,31.0,17000,40.0


Data transformation involves thinking about the future. More times than not, there will be new data coming into your data storage (a place where raw data is stored before being touched), with many people working on different types of data analysis. We want to make sure that whoever wants to use the data in the future can do so.

Let's return once more to our list of questions.

### Can I quickly hand off this data for others to use?
The data now needs to be transformed back into a more user-friendly format. It would be nice if everyone was as great with DataFrames as you two; unfortunately, that is not the case. You'll want to convert the final product into a common data type like CSV or Excel files.

Now that our data has been cleaned and processed, it is ready to be converted to a readable format for future use:

In [25]:
# Saving cleaned data
file_path = "Output/shopping_data_cleaned.csv"
df_shopping.to_csv(file_path, index=False)