# PCA - Principle Component Analysis

# Wine Dataset
## Wines are categorized into 3 customer segments based on featuers listed below:
### Features are:

<ol>
    <li>Alcohol</li>
    <li>Malic acid</li>
    <li>Ash</li>
    <li>Alcalinity of ash</li>
    <li>Magnesium</li>
    <li>Total phenols</li>
    <li>Flavanoids</li>
    <li>Nonflavanoid phenols</li>
    <li>Proanthocyanins</li>
    <li>Color intensity</li>
    <li>Hue</li>
    <li>OD280/OD315 of diluted wines</li>
    <li>Proline </li>
    </ol>
</font>

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Set up the environment for using pyspark
import findspark
findspark.init()

In [None]:
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from pyspark.ml.linalg import Vectors

In [None]:
# Create Application Context
spark = SparkSession.builder.appName("PCA Wine Dataset").getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("Error")

## Data Exploration
1. Create dataframe from the Wine.csv file

In [None]:
sdf = spark.read.format('csv').options(header='true', inferSchema='true').load('../datasets/Wine.csv')

In [None]:
df = sdf.toPandas()
df.head()

In [None]:
df.describe()

In [None]:
df.isnull().sum()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df['Customer_Segment'].unique()

In [None]:
corr = df.corr()
corr

In [None]:
sdf.printSchema()

<font color = 'tomato'>
<h2>Data preparation</h2>
1. Create features using Vector Assembler<br>
2. Standardize the data<br>
</font>

<font color = 'tomato'>
<h2>Apply Principle Component Analysis (PCA)</h2>
    <ol>
        <li>Create PCA instance (select number of components to 2), use stdFeatures</li>
        <li>Fit the instance to scaled data</li>
        <li>Transform with scaled data
        <li>Once the full processing is completed, change the number of components to 4 and compare results</li>
    </ol>
            
</font>

In [None]:
from pyspark.ml.feature import PCA

<font color = 'tomato'>
    <h2>Training and Test set </h2>
    <ol>
        <li>Create Training and test set for the transformed data</li>
    </ol>
</font>

<font color = 'tomato'>
    <h2>Use Logistic Regression </h2>
    <ol>
        <li>Create Logistic Regression instance</li>
        <li>Fit the transformed features</li>
        <li>Transform the model</li>
        <li>Evaluate the model using multi-class classification evaluator</li>
    </ol>
</font>

In [None]:
from pyspark.ml.classification import LogisticRegression

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

<font color = 'tomato'>
    <h2>Confusion Matrix </h2>
    <ol>
        <li>Create the predicted values as pandas dataframe</li>
        <li>Create the test values for Customer_Segment</li>
        <li>Create Confusion Matrix</li>
    </ol>
</font>