# Students Do: Understanding customers

## Instructions

You are given a dataset that contains historical data from purchases of a online store made by 200 customers. In this activity you will put in action your data preprocessing superpowers, also you'll add some new skills needed to start finding customers clusters.

In [9]:
# Initial imports
import pandas as pd
from pathlib import Path

In [10]:
# Data loading
file_path = Path("../Unsupervised-Learning-Crypto-Currencies/crypto_data.csv")
df_crypto = pd.read_csv(file_path)
print(df_crypto)

     Unnamed: 0        CoinName Algorithm  IsTrading ProofType  \
0            42         42 Coin    Scrypt       True   PoW/PoS   
1           365         365Coin       X11       True   PoW/PoS   
2           404         404Coin    Scrypt       True   PoW/PoS   
3           611       SixEleven   SHA-256       True       PoW   
4           808             808   SHA-256       True   PoW/PoS   
...         ...             ...       ...        ...       ...   
1247        XBC     BitcoinPlus    Scrypt       True       PoS   
1248       DVTC      DivotyCoin    Scrypt      False   PoW/PoS   
1249       GIOT     Giotto Coin    Scrypt      False   PoW/PoS   
1250       OPSC  OpenSourceCoin   SHA-256      False   PoW/PoS   
1251       PUNK       SteamPunk       PoS      False       PoS   

      TotalCoinsMined TotalCoinSupply  
0        4.199995e+01              42  
1                 NaN      2300000000  
2        1.055185e+09       532000000  
3                 NaN          611000  
4      

List the DataFrame's data types to ensure they're aligned to the type of data stored on each column.

In [11]:
# List dataframe data types
df_crypto.dtypes


Unnamed: 0          object
CoinName            object
Algorithm           object
IsTrading             bool
ProofType           object
TotalCoinsMined    float64
TotalCoinSupply     object
dtype: object

In [12]:
#Remove rows with `false` values in 'IsTrading' Column.
df_crypto.drop(df_crypto[df_crypto['IsTrading'] == False].index, inplace=True)
print(df_crypto)

     Unnamed: 0     CoinName    Algorithm  IsTrading ProofType  \
0            42      42 Coin       Scrypt       True   PoW/PoS   
1           365      365Coin          X11       True   PoW/PoS   
2           404      404Coin       Scrypt       True   PoW/PoS   
3           611    SixEleven      SHA-256       True       PoW   
4           808          808      SHA-256       True   PoW/PoS   
...         ...          ...          ...        ...       ...   
1243       SERO   Super Zero       Ethash       True       PoW   
1244        UOS          UOS      SHA-256       True      DPoI   
1245        BDX       Beldex  CryptoNight       True       PoW   
1246        ZEN      Horizen     Equihash       True       PoW   
1247        XBC  BitcoinPlus       Scrypt       True       PoS   

      TotalCoinsMined TotalCoinSupply  
0        4.199995e+01              42  
1                 NaN      2300000000  
2        1.055185e+09       532000000  
3                 NaN          611000  
4      

In [13]:
# Remove the IsTrading Column
df_crypto = df_crypto.drop(columns=["IsTrading"])
df_crypto.head()

Unnamed: 0.1,Unnamed: 0,CoinName,Algorithm,ProofType,TotalCoinsMined,TotalCoinSupply
0,42,42 Coin,Scrypt,PoW/PoS,41.99995,42
1,365,365Coin,X11,PoW/PoS,,2300000000
2,404,404Coin,Scrypt,PoW/PoS,1055185000.0,532000000
3,611,SixEleven,SHA-256,PoW,,611000
4,808,808,SHA-256,PoW/PoS,0.0,0


In [14]:
# Find null values
for column in df_crypto.columns:
    print(f"Column {column} has {df_crypto[column].isnull().sum()} null values")



Column Unnamed: 0 has 0 null values
Column CoinName has 0 null values
Column Algorithm has 0 null values
Column ProofType has 0 null values
Column TotalCoinsMined has 459 null values
Column TotalCoinSupply has 0 null values


Remove duplicate entries if any.

In [None]:
# Find duplicate entries
print(f"Duplicate entries: {df_shopping.duplicated().sum()}")


In order to use unsupervised learning algorithms, all the features should be numeric, and also, on similar scales. Perform the following data transformations.

* The `Gender` column contains categorical data, anytime you have categorical variables, you should transform them to a numerical value, in this case, transforming `Male` to `1` and `Female` to `0` is a feasible solution.

In [None]:
# Transform Previous Customer column
def changeStatus(status):
    if status == "Yes":
        return 1
    else:
        return 0

# Along with replace() and map(), this is another way to encode the gender column into numbers.
df_shopping["Previous Shopper"] = df_shopping["Previous Shopper"].apply(changeStatus)
df_shopping.head()


* Here, we will scale the `Age`, `Annual Income` and `Spending Score (1-100)` columns to bring them into the same range as the `Previous Shopper` column.

In [None]:
# Scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df_shopping[['Age', 'Annual Income', 'Spending Score (1-100)']])

In [None]:
# A list of the columns from the original DataFrame
df_shopping.columns

In [None]:
# Create a DataFrame with the transformed data
new_df_shopping = pd.DataFrame(scaled_data, columns=df_shopping.columns[1:])
new_df_shopping['Previous Shopper'] = df_shopping['Previous Shopper']
new_df_shopping.head()

In [None]:
# Rename the spending score column
new_df_shopping = new_df_shopping.rename(columns={'Spending Score (1-100)': 'Spending Score'})
new_df_shopping.head()

Save the cleaned DataFrame as a `CSV` file, name it as `shopping_data_cleaned.csv`.

In [None]:
# Saving cleaned data
file_path = Path("../Resources/shopping_data_cleaned.csv")
new_df_shopping.to_csv(file_path, index=False)
