## D209 Data Mining 1 PA
##### Submitted By Edwin Perry
### Table of Contents
<ol>
    <li><a href="#A">Research Question</a></li>
    <li><a href="#B">Technique Justification</a></li>
    <li><a href="#C">Data Preparation</a></li>
    <li><a href="#D">Analysis</a></li>
    <li><a href="#E">Data Summary and Implications</a></li>
    <li><a href="#F">Panopto Video</a></li>
</ol>


<h4 id="A">Research Question</h4>
<h5>Question</h5>
<p>The question that I had decided to answer is "Are there meaningful groups with distinct, identifiable preferences in the telecommunications industry?". I will use hierarchical clustering to attempt to create these clusters, helping the business to group customers based on important criteria</p>
<h5>Goal</h5>
<p>The goal of this analysis will be to determine the existence of meaningful groups/clusters within the scores they provide for the survey responses. This could help the business to identify customer priorities and improve their service offerings going forward.</p>
<h4 id="B">Technique Justification</h4>
<h5>Clustering Technique Explanation</h5>
<p>Hierarchical clustering is the technique that will be used to analyze the data. This method gathers observations in the raw data and generates multi-leveled clusters, where we have clusters that contain clusters within it. These clusters are formed by groups that are closely related. Then, these clusters are analyzed and clustered together into an intermediate cluster. The intermediate clusters are then grouped together in a similar manner, until all the data is contained within one overall cluster. This is commonly explained by looking at taxonomy, where similar animals form a species, multiple species for. a genus, multiple genuses form a family, and so on until every living thing is grouped together under the "Living" category. 
The expected outcome would be a hierarchy that allows us to look at labelled distribution to identify patterns in consumer sentiment.</p>
<h5>Assumptions</h5>
<p>Hierarchical clustering groups together the data based on the distance between observations. This means that the hierarchical clustering method assumes the data is appropriately scaled. Failing to scale the data will lead to variables with larger ranges having an outsized influence on the cluster formation, and those with smaller ranges having decreased influence on the cluster formation</p>
<h5>Chosen Tools</h5>
<p>Python is the language I elected to use for this analysis, for a number of reasons. First, Python's Jupyter notebook makes it simple to combine the Python code with explanations in one file. Furthermore, the simple syntax makes it easy and intuitive to develop, troubleshoot, and understand the methodology used to cluster the data. Finally, there are a number of packages and libraries that are specifically designed for this type of analysis. The packages and libraries that I intend to use are as follows: 
<ul>
<li>Pandas: Allows for the ingestion and handling of the data in dataframes</li>
<li>Seaborn and Matplotlib for data visualization</li>
<li>NumPy for mathematic operations</li>
<li>SciPy to perform hierarchical clustering and graphic representation</li>
<li>SciKitLearn to evaluate the clustering metric</li>
</ul></p>

In [4]:
import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.cluster.hierarchy import dendrogram
from sklearn.metrics import silhouette_score

df = pd.read_csv('./Data Source/churn_clean.csv', index_col=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10000 entries, 1 to 10000
Data columns (total 49 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Customer_id           10000 non-null  object 
 1   Interaction           10000 non-null  object 
 2   UID                   10000 non-null  object 
 3   City                  10000 non-null  object 
 4   State                 10000 non-null  object 
 5   County                10000 non-null  object 
 6   Zip                   10000 non-null  int64  
 7   Lat                   10000 non-null  float64
 8   Lng                   10000 non-null  float64
 9   Population            10000 non-null  int64  
 10  Area                  10000 non-null  object 
 11  TimeZone              10000 non-null  object 
 12  Job                   10000 non-null  object 
 13  Children              10000 non-null  int64  
 14  Age                   10000 non-null  int64  
 15  Income                10

<h4 id="C">Data Preparation</h4>
<h5>Example of Preprocessing Goals</h5>
<p>Overall, we have a number of preprocessing goals before we can utilize the data for this analysis. One such goal is to remove unnecessary columns irrelevant to the analysis. This is important because reducing the overall amount of data included in the analysis will optimize the code's performance and improve the overall efficienct if this analysis</p>
<h5>Initital Dataset Variables</h5>
<p>There are only a handful of relevant variables being considered for this analysis, including:
<ul>
<li>Item1 (Ordinal Categorical): The importance of timely responses, with values ranging from 1 to 8, with higher numbers indicating least importance</li>
<li>Item2 (Ordinal Categorical): The importance of timely fixes to issues, with values ranging from 1 to 8, with higher numbers indicating least importance</li>
<li>Item3 (Ordinal Categorical): The importance of timely replacement of devices, with values ranging from 1 to 8, with higher numbers indicating least importance</li>
<li>Item4 (Ordinal Categorical): The importance of technological reliability, with values ranging from 1 to 8, with higher numbers indicating least importance</li>
<li>Item5 (Ordinal Categorical): The importance of variety of options, with values ranging from 1 to 8, with higher numbers indicating least importance</li>
<li>Item6 (Ordinal Categorical): The importance of respectful responses, with values ranging from 1 to 8, with higher numbers indicating least importance</li>
<li>Item7 (Ordinal Categorical): The importance of courteous exchanges and discussions, with values ranging from 1 to 8, with higher numbers indicating least importance</li>
<li>Item8 (Ordinal Categorical): The importance of observable active listening, with values ranging from 1 to 8, with higher numbers indicating least importance</li>
</ul>
It is important to note that numerical data is not necessarily quantitative or continuous data. All of the above variables are stored specifically as integers, and indicate a priority level. These scores are meant to order the importance of the factors, rather than the numbers indicating that, for example, an item with a score of 6 is half as important as an item with a score of 3. As this is meant to establish an order to establish overall importance of factors, these variables would be considered ordinal and categorical, not quantitative and continuous.</p>
<h5>Data Preparation</h5>
<p>Below, the steps used to prepare the data are included. The first step in the analysis will be to drop those columns that are not relevant for the analysis</p>

In [5]:
df = df[['Item1', 'Item2', 'Item3', 'Item4', 'Item5', 'Item6', 'Item7', 'Item8']]
df.head()

Unnamed: 0_level_0,Item1,Item2,Item3,Item4,Item5,Item6,Item7,Item8
CaseOrder,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,5,5,5,3,4,4,3,4
2,3,4,3,3,4,3,4,4
3,4,4,2,4,4,3,3,3
4,4,4,4,2,5,4,3,3
5,4,4,4,3,4,4,4,5


<p>The next step in the analysis will be to ensure there are no invalid entries. This means that we need to check the remaining columns such that there are no null values or values outside of the acceptable 1-8 range</p>

In [6]:
# Filter the dataframe to ensure all values are between 1 and 8 (inclusive)
df = df[(df >= 1) & (df <= 8)].dropna()
df.head()

Unnamed: 0_level_0,Item1,Item2,Item3,Item4,Item5,Item6,Item7,Item8
CaseOrder,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,5,5,5,3,4,4,3,4
2,3,4,3,3,4,3,4,4
3,4,4,2,4,4,3,3,3
4,4,4,4,2,5,4,3,3
5,4,4,4,3,4,4,4,5


<h5>Copy of Cleaned Data</h5>
<p>A copy of this cleaned and prepared data will be exported at this point, and will be attached to this submission:</p>

In [7]:
df.to_csv('./D212P1CleanedData.csv')

<h4 id="D">Analysis</h4>
<h5>Cluster Number</h5>