## 18.2.1
### Steps for Preparing Data
After digging into unsupervised learning a bit, you realize that your first step in convincing *Accountability Accountants* to invest in cryptocurrency is to preprocess the data.

You and Martha open up the dataset to get started preprocessing it. Together, you will want to manage unnecessary columns, rows with null values, and mixed data types before turning your algorithm loose.
Data Selection

Before moving data to our unsupervised algorithms, complete the following steps for preparing data:

   - Data selection
   - Data processing
   - Data transformation

Data selection entails making good choices about which data will be used. Consider what data is available, what data is missing, and what data can be removed. For example, say we have a dataset on city weather that consists of temperature, population, latitude and longitude, date, snowfall, and income. After looking through the columns, we can readily see that population and income data don't affect weather. We might also notice some rows are missing temperature data. In the data selection process, we would remove the population and income columns as well as any rows that don't record temperatures.

#### Data Processing

Data processing involves organizing the data by formatting, cleaning, and sampling it. In our dataset on city weather, if the date column has two different formats—mm-dd-yyyy (e.g., 01-23-1980) and month-data-year (e.g., jan-23-1980)—we would convert all dates to the same format.

#### Data Transformation

Data transformation entails transforming our data into a simpler format for storage and future use, such as a CSV, spreadsheet, or database file. Once our weather data is cleaned and processed, we would export the final version of the data as a CSV file for future analysis.

## 18.2.2
### Pandas Refresher
When it comes to preprocessing data, you have good news for Martha. The Pandas Python library is really good at this! When Martha asks for a quick refresher on how to use Pandas for data munging, you know just the dataset to use—the iris dataset from the University of California, Irvine (UCI) Machine Learning Repository.

Pandas is a Python library that is excellent for data munging. We'll be using the iris dataset from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/iris), a common dataset used throughout machine learning:

   1. Store the raw iris.csv (https://archive.ics.uci.edu/ml/datasets/iris)

   2. Open a new Jupyter Notebook.

   3. Import your libraries:

        import pandas as pd

   4. To load the dataset in a Pandas DataFrame, enter the code below. Be sure to use the path to the stored CSV file (stored in an easy-to-access location):

        - file_path = "<folder path to stored data sets>/iris.csv"
        - iris_df = pd.read_csv(file_path)
        - iris_df.head()

   5. Select the fields of data you want:

         - The CSV is read into a DataFrame. The resulting DataFrame displays the columns sepal_length_sepal_width, petl_length, petal_width, and class.

    **note:**
    Unsupervised learning will be used to determine the class of the iris plants later on in the module.

   6. Drop the class field using the code below:
         - new_iris_df = iris_df.drop(["class"], axis=1)
        - new_iris_df.head()

    Drop the class column from the DataFrame and display resulting DataFrame.

**skill drill**

Try reordering the columns so the sepal and petal lengths are the first two columns and the widths are the last two columns.
End of text box.

Cleaning this dataset appears complete with all the data in numerical form and the same type, so no data processing is needed. However, you'll encounter data transformations on datasets that contain categorical data or non-numeric features (e.g., transforming male and female categorical values to 0 and 1, respectively).

Finally, the preprocessed DataFrame is saved on a new CSV file for future use. This is done by storing the file path in a variable, then using the Pandas to_csv() method to export the DataFrame to a CSV by supplying the file path and file name as arguments, as shown below:

- output_file_path = "<path to folder>/new_iris_data.csv"
- new_iris_df.to_csv(output_file_path, index=False)

In [4]:
# Doing 18.2.2
import pandas as pd

file_path = "../Datasets/iris.csv"
iris_df = pd.read_csv(file_path)
iris_df.head()

new_iris_df = iris_df.drop(['class'], axis = 1)
new_iris_df.head()

output_file_path = "../Exported_Data/new_iris_data.csv"
new_iris_df.to_csv(output_file_path, index=False)

## 18.2.3
### Preprocessing Data With Pandas
Martha is super grateful for the Pandas refresher—it's always fun to work with such a classic dataset! Now that you are both on the same page, you start to think critically about your cryptocurrency dataset.

As mentioned, we don't know the output of the data, but that doesn't mean we shouldn't think about our data or that we should carelessly plug it into a model.

Let's take a look at how we should start our data processing by loading in the shopping_data.csv (Links to an external site.)

Read in shopping data and display the DataFrame.

#### Questions for Data Preparation

Unsupervised learning doesn't have a clear outcome or target variable like supervised learning, but it is used to find patterns. By properly preparing the data, we can select features that help us find patterns or groups.

Before we begin, consider these questions:

    What knowledge do we hope to glean from running an unsupervised learning model on this dataset?
    What data is available? What type? What is missing? What can be removed?
    Is the data in a format that can be passed into an unsupervised learning model?
    Can I quickly hand off this data for others to use?

Let's address the first question on our list:

What knowledge do we hope to glean from running an unsupervised learning model on this dataset?

It's a shopping dataset, so we can group together shoppers based on spending habits


In [5]:
# Doing 18.2.3
file_path = "../Datasets/shopping_data.csv"
df_shopping = pd.read_csv(file_path, encoding="ISO-8859-1")
df_shopping.head(5)

Unnamed: 0,CustomerID,Card Member,Age,Annual Income,Spending Score (1-100)
0,1,Yes,19.0,15000,39.0
1,2,Yes,21.0,15000,81.0
2,3,No,20.0,16000,6.0
3,4,No,23.0,16000,77.0
4,5,No,31.0,17000,40.0


## 18.2.4
### Data Selection
It's not every day that you and Martha have a chance to convince an accounting firm to invest in cryptocurrency! So, you want to make sure you know how to select the data that will best help the model determine patterns or grouping.

To help us select the data, let's return to some of the questions on our list.

#### What data is available?

First, account for the data you have. After all, you can't extract knowledge without data. We can use the columns method and output the columns, as shown below:

    '# Columns
    df_shopping.columns

Looking at the columns, we see there is data for CustomerID, Age, Annual Income, and Spending Score:

Displays a list that contains the columns CustomerID, Card Member, Age, Annual Income, and Spending Score (1-100)

Now that we know what data we have, we can start thinking about possible analysis. For example, data points for features like Age and Annual Income might appear in our end result as groupings or clusters. However, there are no data points for items purchased, so our algorithms cannot discover related patterns.

#### What type of data is available?

Using the dtypes method, confirm the data type, which also will alert us if anything should be changed in the next step (e.g., converting text to numerical data). All the columns we plan to use in our model must contain a numerical data type:

Use the dtypes method to display the data types of the shopping DataFrame.

#### What data is missing?

Next, let's see if any data is missing. Unsupervised learning models can't handle missing data. If you try to run a model on a dataset with missing data, you'll get an error such as the one below:

     ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

If you initially had hoped to produce an outcome using a type of data, but it turned out more than 80% of those rows are empty, then the results won't be very accurate!

For example, return to our Age and Income groups: If it turns out there are 1,200 rows without any Age data points, then we clearly can't use that column in our model. There is no set cutoff for missing data—that decision is left up to you, the analyst, and must be made based on your understanding of the business needs.

**note:**
Handling missing data is a complex topic that is out of scope for this unit. However, if you're interested, read this article (https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4) on the possible approaches to handling missing data.

Pandas has the isnull() method to check for missing values. We'll loop through each column, check if there are null values, sum them up, and print out a readable total:

Loop through the columns in the shopping DataFrame, check for null values, and print totals for each column.

There will be a few rows with missing values that we'll need to handle. The judgement call will be to either remove these rows or decide that the dataset is not suitable for our model. In this case, we'll proceed with handling these values because they are a small percentage of the overall data.

**important:**
When deciding to proceed, the percentage of data missing isn't always the only determining factor. See the Note callout above for a resource on handling missing data.

#### What data can be removed?

You have begun to explore the data and have taken a look at null values. Next, determine if the data can be removed. Consider: Are there string columns that we can't use? Are there columns with excessive null data points? Was our decision to handle missing values to just remove them?

In our example, there are no string type columns, and we made the decision that only a few rows have null data points, but not enough to remove a whole column's worth.

Rows of data with null values can be removed with the dropna() method, as shown below:

    '# Drop null rows
    df_shopping = df_shopping.dropna()

Duplicates can also be removed.

Use the duplicated().sum() method to check for duplicates, as shown below:

Use the duplicated().sum() method to determine how many duplicates are in the DataFrame. The result prints out zero duplicate entries.

Looks good with no duplicates!

To remove the column, just enter the code below:

Drop the CustomerID column and display the resulting DataFrame.

In [6]:
# Doing 18.2.4

# Columns
df_shopping.columns

# List dataframe data types 
df_shopping.dtypes

# Find null values 
for column in df_shopping.columns: 
    print(f"Column {column} has {df_shopping[column].isnull().sum()} null values")
    
# Drop null rows
df_shopping = df_shopping.dropna()

# Find suplicate entries 
print(f"Duplicate entries: {df_shopping.duplicated().sum()}")

# Remove the CustomerID Column 
df_shopping.drop(columns=["CustomerID"], inplace=True)
df_shopping.head()

Column CustomerID has 0 null values
Column Card Member has 2 null values
Column Age has 2 null values
Column Annual Income has 0 null values
Column Spending Score (1-100) has 1 null values
Duplicate entries: 0


Unnamed: 0,Card Member,Age,Annual Income,Spending Score (1-100)
0,Yes,19.0,15000,39.0
1,Yes,21.0,15000,81.0
2,No,20.0,16000,6.0
3,No,23.0,16000,77.0
4,No,31.0,17000,40.0


## 18.2.5
### Data Processing
Now that you know what kind of data you want to work, it's time to meet the needs for your unsupervised model.

The next step is to move on from what you (the user) want to get out of your data and on to what the unsupervised model needs out of the data.

Recall that in the data selection step, you, as the user, are exploring the data to see what kind of insights and analysis you might glean. You reviewed the columns available and the data types stored, and determined if there were missing values.

For data processing, the focus is on making sure the data is set up for the unsupervised learning model, which requires the following:

   - Null values are handled.
   - Only numerical data is used.
   - Values are scaled. In other words, data has been manipulated to ensure that the variance between the numbers won't skew results.

**rewind

Recall that when features have different scales, they can have a disproportionate impact on the model. The unscaled value could lead to messy graphs. Therefore, it is important to understand when to scale and normalize data. For example, if four columns of data are single digits, and the fifth column is in the millions, we would need to scale the fifth column to align the other four.**

Let's return again to our list of questions.

#### Is the data in a format that can be passed into an unsupervised learning model?

We saw before that all our data had the correct type for each column; however, we know that our model can't have strings passed into it.

To make sure we can use our string data, we'll transform our strings of Yes and No from the Card Member column to 1 and 0, respectively, by creating a function that will convert Yes to a 1 and anything else to 0.

The function will then be run on the whole column with the .apply method, as shown below:

Create a function to change the Yes from the Card Member column to a 1 and anything else to a 0. Then apply the function on the Card Member column of the shopping DataFrame.

Also, there is one more thing you may notice about the data. The scale for Annual Income is much larger than all the other values in the dataset. We can adjust this format by dividing by 1,000 to rescale those data points, as shown below:

Scale the Annual Income column down by dividing by 1,000. Then display the resulting DataFrame.

**skill drill**

Reformat the names of the columns so they contain no spaces or numbers.
End of text box.

In [8]:
# Doing 18.2.5
#Transform String column 
def change_string(member):
    if member == "Yes":
        return 1
    else: 
        return 0 

df_shopping["Card Member"] = df_shopping["Card Member"].apply(change_string)
df_shopping.head()

# Transform annual income 
df_shopping["Annual Income"] = df_shopping["Annual Income"]/1000
df_shopping.head()

Unnamed: 0,Card Member,Age,Annual Income,Spending Score (1-100)
0,0,19.0,15.0,39.0
1,0,21.0,15.0,81.0
2,0,20.0,16.0,6.0
3,0,23.0,16.0,77.0
4,0,31.0,17.0,40.0


## 18.2.6
### Data Transformation
You have done all this work to get your data ready to be passed into an unsupervised learning model, but what about when other teams need to use this data? The next step is transforming your data into a convenient way for others to use in the future.

Data transformation involves thinking about the future. More times than not, there will be new data coming into your data storage (a place where raw data is stored before being touched), with many people working on different types of data analysis. We want to make sure that whoever wants to use the data in the future can do so.

Let's return once more to our list of questions.

#### Can I quickly hand off this data for others to use?

The data now needs to be transformed back into a more user-friendly format. It would be nice if everyone was as great with DataFrames as you two; unfortunately, that is not the case. You'll want to convert the final product into a common data type like CSV or Excel files.

Now that our data has been cleaned and processed, it is ready to be converted to a readable format for future use:

    '# Saving cleaned data
    file_path = "<path to your folder>/shopping_data_cleaned.csv"
    df_shopping.to_csv(file_path, index=False)

**skill drill**

Try to export the data to a different format.
End of text box.

Now you know the questions to ask about your data and understand the Pandas processes used to help answer those questions. Different datasets have different issues. With practice, you'll get better at identifying these.

In [10]:
# Doing 18.2.6
# Saving cleaned data
file_path = "../Exported_Data/shopping_data_cleaned.csv"
df_shopping.to_csv(file_path, index=False)