<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br><h2>Script 02 | Principal Component Analysis</h2>
<br>
Written by Chase Kusterer<br>
<a href="https://github.com/chase-kusterer">GitHub</a> | <a href="https://www.linkedin.com/in/kusterer/">LinkedIn</a>
<br><br><br>

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<h2>Part I: Introduction and Preparation</h2><br>

<strong>A Note on the Dataset</strong><br>
The dataset in this script represents the annual spending of a subset of the top customers for Apprentice Chef, Inc. The monetary units are unknown, and the demographic information related to each client is as follows:<br><br><br>
<u>Channel</u><br>

1. Online
2. Mobile App

<br>
<u>Region</u><br>

1. Alameda
2. San Francisco
3. Contra Costa

<br><br>
Run the following code to import necessary packages, load data, and set display options. 

In [None]:
########################################
# importing packages
########################################
import numpy             as np                   # mathematical essentials
import pandas            as pd                   # data science essentials
import matplotlib.pyplot as plt                  # fundamental data visualization
import seaborn           as sns                  # enhanced visualization
from sklearn.preprocessing import StandardScaler # standard scaler
from sklearn.decomposition import PCA            # pca


########################################
# loading data and setting display options
########################################
# loading data
customers_df = pd.read_excel(io = './datasets/top_customers.xlsx')


# setting print options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 100)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<strong>User-Defined Functions</strong><br>
Run the following code to load the user-defined functions used throughout this Notebook.

In [None]:
########################################
# scree_plot
########################################
def scree_plot(pca_object, export = False):
    """
    Visualizes a scree plot from a pca object.
    
    PARAMETERS
    ----------
    pca_object | A fitted pca object
    export     | Set to True if you would like to save the scree plot to the
               | current working directory (default: False)
    """
    # building a scree plot

    # setting plot size
    fig, ax = plt.subplots(figsize=(10, 8))
    features = range(pca_object.n_components_)


    # developing a scree plot
    plt.plot(features,
             pca_object.explained_variance_ratio_,
             linewidth = 2,
             marker = 'o',
             markersize = 10,
             markeredgecolor = 'black',
             markerfacecolor = 'grey')


    # setting more plot options
    plt.title('Scree Plot')
    plt.xlabel('PCA feature')
    plt.ylabel('Explained Variance')
    plt.xticks(features)

    if export == True:
    
        # exporting the plot
        plt.savefig('./analysis_images/top_customers_correlation_scree_plot.png')
        
    # displaying the plot
    plt.show()


########################################
# unsupervised_scaler
########################################
def scaler(df):
    """
    Standardizes a dataset (mean = 0, variance = 1). Returns a new DataFrame.
    Requires sklearn.preprocessing.StandardScaler()
    
    PARAMETERS
    ----------
    df     | DataFrame to be used for scaling
    """

    # INSTANTIATING a StandardScaler() object
    scaler = StandardScaler(copy = True)


    # FITTING the scaler with the data
    scaler.fit(df)


    # TRANSFORMING our data after fit
    x_scaled = scaler.transform(df)

    
    # converting scaled data into a DataFrame
    new_df = pd.DataFrame(x_scaled)


    # reattaching column names
    new_df.columns = list(df.columns)
    
    return new_df

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h4>a) Write code to check information about non-missing values and data types for each column.</h4>

In [None]:
# checking information about each feature
_____

In [None]:
# checking information about each feature
customers_df.info(verbose = True)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h4>b) Write code to check descriptive statistics on the numeric features of the dataset.</h4>

In [None]:
# descriptive statistics about each numeric feature
_____

In [None]:
# descriptive statistics about each numeric feature
customers_df.describe(include = 'number').round(decimals = 2)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

In [None]:
# value counts for channel and region
print(f"""\
Channel
-------
{customers_df['Channel'].value_counts(normalize=False).to_string(buf=None)}


Region
------
{customers_df['Region'].value_counts(normalize=False).to_string(buf=None)}""")

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<strong>Run the following code to generate histograms for each of the features in the dataset.</strong>

In [None]:
# setting figure size
fig, ax = plt.subplots(figsize = (12, 8))
ax.remove()

# initializing a counter
count = 0


# looping to create visualizations
for col in customers_df:

    # condition to break
    if count == 8:
        break
    
    # increasing count
    count += 1
    
    # preparing histograms
    plt.subplot(3, 3, count)
    sns.histplot(x = customers_df[col],)


# formatting, saving, and displaying the plot
plt.tight_layout()
plt.show()

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<strong>Datasets with Features for Different Purposes</strong><br>
Notice from the outputs above that the dataset contains demographic data (channel and region) as well purchasing data (spending per category). In unsupervised learning, feature types such as these should not be used together in the same algorithm. Demographic data is extremely different from purchasing data, and their concatenation would bias the results of an analysis. Instead, if a problem requires unsupervised learning and demographic data is present in the dataset, a best practice is to remove the demographic data before building an algorithm. Later, demographic data can be used to compare results.<br><br><br>
<strong>PCA and Scaling</strong><br>
As with KNN, explanatory variables should be scaled before developing a principal component analysis algorithm.<br><br><br>
<h4>c) Complete the following in the code below:</h4>

* drop demographic data and the non-logarithmic features. Store the result as purchase_behavior
* instantiate a StandardScaler( ) object
* fit the scaler object to purchase_behavior
* transform purchase_behavior using the scaler object

In [None]:
# removing demographic data
purchases_df = customers_df.drop(['Channel', 'Region'], axis = 1)


# scaling features before correlation analysis
purchases_scaled = scaler(df = _____)


# checking pre- and post-scaling variance
print(np.var(customers_df), '\n\n')
print(np.var(purchases_scaled))

In [None]:
# removing demographic data
purchases_df = customers_df.drop(['Channel', 'Region'], axis = 1)


# scaling features before correlation analysis
purchases_scaled = scaler(df = purchases_df)


# checking pre- and post-scaling variance
print(np.var(customers_df), '\n\n')
print(np.var(purchases_scaled))

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h4>d) Fill in the blanks below to develop a correlation heatmap of the scaled purchasing features.</h4>

In [None]:
# setting plot size
fig, ax = plt.subplots(figsize = (8, 8))


# developing a correlation matrix object
df_corr = _____._____.round(decimals = 2)


# creating a correlation heatmap
sns.heatmap(data   = _____,
            cmap   = 'Blues',
            square = True,
            annot  = True)


# rendering the heatmap
plt.show()

In [None]:
# setting plot size
fig, ax = plt.subplots(figsize = (8, 8))


# developing a correlation matrix object
df_corr = purchases_scaled.corr(method = 'pearson').round(decimals = 2)


# creating a correlation heatmap
sns.heatmap(data   = df_corr,
            cmap   = 'Blues',
            square = True,
            annot  = True)


# rendering the heatmap
plt.show()

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

Notice that only a few (Pearson) correlations have an absolute value above 0.50. This makes the dataset a good candidate for PCA. As such, we may be able to explain a high degree of variance with a small number of principal components.<br><br>

<h2>Part II: Principal Component Analysis</h2><br>
Principal component analysis is primarily conducted in three situations:<br>

<u>Correlated Explanatory Variables</u><br>
Model building with correlated explanatory variables is a violation of one of the key assumptions of generalized linear models.<br><br>

<u>Dimensionality Reduction</u><br>
This is commonly conducted when a dataset has a large amount of explanatory variables (i.e., every unique click a user has made on a website). Techniques like PCA allow features to be transformed into principal components, (potentially) reducing the number of features needed to explain a high degree of variance.<br><br>

<u>Latent Trait Exploration</u><br>
Understanding factors that cannot be measured directly through measurable constructs.<br><br><br>
<strong>Determining the Number of Principal Components</strong><br>A common heuristic is to include enough principal components to explain at least 80% of the variance in a dataset.
<br><br>

<h4>a) Complete the code below.</h4>
Complete the code to instantiate, fit, and transform a PCA model with no limits to its number of principal components. Make sure to use the scaled dataset for this task.

In [None]:
# INSTANTIATING a PCA object with no limit to principal components
pca = _____(n_components = None,
            random_state = 702)


# FITTING and TRANSFORMING the scaled data
customer_pca = _____._____(_____)


# comparing dimensions of each DataFrame
print("Original shape:", purchases_scaled.shape)
print("PCA shape     :", customer_pca.shape)

In [None]:
# INSTANTIATING a PCA object with no limit to principal components
pca = PCA(n_components = None,
          random_state = 702)


# FITTING and TRANSFORMING the scaled data
customer_pca = pca.fit_transform(purchases_scaled)


# comparing dimensions of each DataFrame
print("Original shape:", purchases_scaled.shape)
print("PCA shape     :", customer_pca.shape)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part III: Evaluating PCA Algorithms</h2><br>
As can be observed from above, the shape of the data did not change. However, the original DataFrame contains features, whereas the new DataFrame contains principal components. Before analyzing the factor loadings of each principal component, it is important to check each component's explained variance ratio. Also note that the sum of all explained variance ratios should sum to 1.0.<br><br><br>
<h4>a) Complete the loop to print out each explained variance ratio.</h4>
Write code to loop over each principal component, printing its component number as well as its explained variance ratio.

In [None]:
# component number counter
component_number = 0

# looping over each principal component
_____ variance _____ pca._____:
    component_number += _____
    
    print(f"PC {_____}: {_____.round(3)}")

In [None]:
# component number counter
component_number = 0


# looping over each principal component
for variance in pca.explained_variance_ratio_:
    component_number += 1
    print(f"PC {component_number}: {variance.round(3)}")

<br>

In [None]:
# printing the sum of all explained variance ratios
print(pca.explained_variance_ratio_.sum(axis = 0))

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<strong>Scree Plots</strong><br>
One useful tool to visualize the explained variance of each principal component is the scree plot. Our goal in analyzing this plot is to look for a point where there is a drop in the marginal return of explained variance. In other words, we are looking for an "elbow" in the plot, where the line connecting each principal component becomes less steep.<br><br>
<h4>c) Call the scree_plot function on the PCA object.</h4>

In [None]:
# calling the scree_plot function
_____

In [None]:
# calling the scree_plot function
scree_plot(pca_object = pca)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part IV: Interpreting Principal Components and Latent Traits</h2><br>
Principal components are essentially "bundles" of various parts of the explanatory variables that were used when building an algorithm. Note that each principal component is not directly measurable, but can be measured indirectly by analyzing its <strong>factor loadings</strong>. In other words, we can interpret the meaning of each principal component by looking into which features are strongly correlated with it.<br><br>
Run the following code and analyze the resulting correlation map between the original features and the principal components.

In [None]:
# setting plot size
fig, ax = plt.subplots(figsize = (12, 12))


# developing a PC to feature heatmap
sns.heatmap(pca.components_, 
            cmap = 'coolwarm',
            square = True,
            annot = True,
            linewidths = 0.1,
            linecolor = 'black')


# setting more plot options
plt.yticks([0, 1, 2, 3, 4, 5],
           ["PC 1", "PC 2", "PC 3", "PC 4", "PC 5", "PC 6"])

plt.xticks(range(0, 6),
           purchases_scaled.columns,
           rotation=60,
           ha='left')

plt.xlabel(xlabel = "Feature")
plt.ylabel(ylabel = "Principal Component")


# displaying the plot
plt.show()

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

Each observation in the dataset is a customer of Apprentice Chef, Inc. Therefore, each principal component can be thought of as a behavioral scale. Naming scales is subjective and often benefits from working with others.<br><br><br>
<h4>a) Analyze the PC factor loadings.</h4>
Run the following code. With your team, analyze the factor loadings and develop a scale for each principal component. When finished, rename the columns of the table with your team's scale names.

In [None]:
# transposing pca components
factor_loadings_df = pd.DataFrame(np.transpose(pca.components_.round(decimals = 2)))


# naming rows as original features
factor_loadings_df = factor_loadings_df.set_index(purchases_scaled.columns)


# checking the result
print(factor_loadings_df)


# saving to Excel
factor_loadings_df.to_excel(excel_writer = 'customer_factor_loadings.xlsx',
                            index        = False)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

In [None]:
# naming each principal component
factor_loadings_df._____ = _____


# checking the result
factor_loadings_df

In [None]:
# naming each principal component
factor_loadings_df.columns = ['Protein Over Vitamins', # - Vegan, - Veg, - Indian
                              'Sleeptime Bliss',       # - Med, - ME, - Wine
                              'Palate Practicality',   # + Med, - Wine
                              '3',                     # after elbow
                              '4',                     # after elbow
                              '5']                     # after elbow


# checking the result
factor_loadings_df

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<strong>Customer-Level Personas</strong><br>
Earlier in this script we instantiated, fit, and transformed the dataset's original features into principal components:<br><br>

~~~
# FITTING and TRANSFORMING the scaled data
customer_pca = pca.fit_transform(purchases_scaled)
~~~

<br>
Now that we have developed personas, we can analyze how much each customer fits into each group. Run the following code to view the personas and factor loadings for each customer.

In [None]:
# converting into a DataFrame 
customer_pca = pd.DataFrame(customer_pca)


# renaming columns
customer_pca.columns = factor_loadings_df.columns


# checking results
customer_pca

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

Digging deeper into the DataFrame above can unearth key findings and market opportunities. <strong>This is something expected of you on your final project.</strong> As an example, if we were exploring the market potential for customers with a standard deviation of one or above in the Healthfood Heroes persona, we could do so through subsetting, as in the following code. Try this on other personas and enjoy the exploration :)

In [None]:
# exploring customers in the Healthfood Heroes persona
len(customer_pca['Protein Over Vitamins'][customer_pca['Protein Over Vitamins'] > 1.0])

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part VI: Reducing to Relevant Principal Components</h2><br>
In this example, we will assume three PCs is a reasonable number based on the elbow in the scree plot. Also note that it would have been reasonable to retain enough PCs so that the cumulative explained variance ratio is greater than or equal to 0.80. Note that we do not need to rerun the scree plot after completing this step.
<br>
<h4>a) Complete the code to develop a new PCA model.</h4>

In [None]:
# INSTANTIATING a new model using the first three principal components
pca_3 = _____(_____,
            _____ = 702)


# FITTING and TRANSFORMING the purchases_scaled
customer_pca_3 = _____._____(_____)


# calling the scree_plot function
_____

In [None]:
# INSTANTIATING a new model using the first three principal components
pca_3 = PCA(n_components = 3,
            random_state = 702)


# FITTING and TRANSFORMING the purchases_scaled
customer_pca_3 = pca_3.fit_transform(purchases_scaled)


# calling the scree_plot function
scree_plot(pca_object = pca_3,
           export     = False)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<strong>OPTIONAL STEP</strong><br>Run the following code to compare the variance of the unlimited PCA model with the variance of the reduced PCA model. We are doing this in this script simply to show that the explained variance for each principal component does not change after dropping smaller PCs.

In [None]:
####################
### Max PC Model ###
####################
# transposing pca components (pc = MAX)
factor_loadings = pd.DataFrame(np.transpose(pca.components_))


# naming rows as original features
factor_loadings = factor_loadings.set_index(purchases_scaled.columns)


##################
### 3 PC Model ###
##################
# transposing pca components (pc = 3)
factor_loadings_3 = pd.DataFrame(np.transpose(pca_3.components_))


# naming rows as original features
factor_loadings_3 = factor_loadings_3.set_index(purchases_scaled.columns)


# checking the results
print(f"""
MAX Components Factor Loadings
------------------------------
{factor_loadings.round(decimals = 2)}


3 Components Factor Loadings
------------------------------
{factor_loadings_3.round(decimals = 2)}
""")

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h4>b) Analyze and name each principal component based on its factor loading.</h4>

In this step, make sure to develop a story behind what each PC name represents. This is an ideal method for bridging the gap between the technical and non-technical people you are working with. Remember, by doing a good job here you are putting analytics at the forefront of strategic decision making, which is a great way to boost your value within an organization.

In [None]:
# naming each principal component
factor_loadings_3._____ = ['Protein Over Vitamins', # - Vegan, - Veg, - Indian
                           'Sleeptime Bliss',       # - Med, - ME, - Wine
                           'Palate Practicality',]  # Med, no Wine


# checking the result
_____

In [None]:
# naming each principal component
factor_loadings_3.columns = ['Protein Over Vitamins', # - Vegan, - Veg, - Indian
                             'Sleeptime Bliss',       # - Med, - ME, - Wine
                             'Palate Practicality',]  # Med, no Wine


# checking the result
factor_loadings_3.round(decimals = 2)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

In [None]:
# converting customer-level data into DataFrame
customer_pca_3 = pd.DataFrame(customer_pca_3)


# renaming customer-level data
customer_pca_3.columns = list(factor_loadings_3.columns)


# checking factor loadings per customer
customer_pca_3.head(n = 5)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

~~~


 ,--.-,,-,--,             .-._            _,---.                                        
/==/  /|=|  |.--.-. .-.-./==/ \  .-._ _.='.'-,  \  .-.,.---.  ,--.-.  .-,--.            
|==|_ ||=|, /==/ -|/=/  ||==|, \/ /, /==.'-     / /==/  `   \/==/- / /=/_ /             
|==| ,|/=| _|==| ,||=| -||==|-  \|  /==/ -   .-' |==|-, .=., \==\, \/=/. /              
|==|- `-' _ |==|- | =/  ||==| ,  | -|==|_   /_,-.|==|   '='  /\==\  \/ -/               
|==|  _     |==|,  \/ - ||==| -   _ |==|  , \_.' )==|- ,   .'  |==|  ,_/                
|==|   .-. ,\==|-   ,   /|==|  /\ , \==\-  ,    (|==|_  . ,'.  \==\-, /                 
/==/, //=/  /==/ , _  .' /==/, | |- |/==/ _  ,  //==/  /\ ,  ) /==/._/                  
`--`-' `-`--`--`..---'   `--`./  `--``--`------' `--`-`--`--'  `--`-`                   
     _,---.     _,.---._                                                                
  .-`.' ,  \  ,-.' , -  `.   .-.,.---.                                                  
 /==/_  _.-' /==/_,  ,  - \ /==/  `   \                                                 
/==/-  '..-.|==|   .=.     |==|-, .=., |                                                
|==|_ ,    /|==|_ : ;=:  - |==|   '='  /                                                
|==|   .--' |==| , '='     |==|- ,   .'                                                 
|==|-  |     \==\ -    ,_ /|==|_  . ,'.                                                 
/==/   \      '.='. -   .' /==/  /\ ,  )                                                
`--`---'        `--`--''   `--`-`--`--'                                                 
   ,-,--.                _,.----.    _,.----.       ,----.    ,-,--.    ,-,--.   .=-.-. 
 ,-.'-  _\ .--.-. .-.-..' .' -   \ .' .' -   \   ,-.--` , \ ,-.'-  _\ ,-.'-  _\ /==/_ / 
/==/_ ,_.'/==/ -|/=/  /==/  ,  ,-'/==/  ,  ,-'  |==|-  _.-`/==/_ ,_.'/==/_ ,_.'|==|, |  
\==\  \   |==| ,||=| -|==|-   |  .|==|-   |  .  |==|   `.-.\==\  \   \==\  \   |==|  |  
 \==\ -\  |==|- | =/  |==|_   `-' \==|_   `-' \/==/_ ,    / \==\ -\   \==\ -\  /==/. /  
 _\==\ ,\ |==|,  \/ - |==|   _  , |==|   _  , ||==|    .-'  _\==\ ,\  _\==\ ,\ `--`-`   
/==/\/ _ ||==|-   ,   |==\.       |==\.       /|==|_  ,`-._/==/\/ _ |/==/\/ _ | .=.     
\==\ - , //==/ , _  .' `-.`.___.-' `-.`.___.-' /==/ ,     /\==\ - , /\==\ - , /:=; :    
 `--`---' `--`..---'                           `--`-----``  `--`---'  `--`---'  `=`                                                                      



~~~

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br>