## Course challenge

***

### Markdown Guides

> This is a blockquote.

Some of these words *are emphasized*.

Use two asterisks for **strong emphasis**.

*   Another item in the list.

This is an [example link](http://example.com/).

$x = x + y$

[text to appear as link](#linkhandle)

Images inline
![image](https://imgbbb.com/images/2019/12/18/Screenshot-2019-12-18-at-12.55.36-PM.png)

## Project Description

As part of the data science team at Gourmet Analytics, you use data analytics to advise companies in the food industry. You clean, organize, and visualize data to arrive at insights that will benefit your clients. As a member of a collaborative team, sharing your analysis with others is an important part of your job. 

Your current client is Chocolate and Tea, an up-and-coming chain of cafes. 

The eatery combines an extensive menu of fine teas with chocolate bars from around the world. Their diverse selection includes everything from plantain milk chocolate, to tangerine white chocolate, to dark chocolate with pistachio and fig. The encyclopedic list of chocolate bars is the basis of Chocolate and Tea’s brand appeal. Chocolate bar sales are the main driver of revenue. 

Chocolate and Tea aims to serve chocolate bars that are highly rated by professional critics. They also continually adjust the menu to make sure it reflects the global diversity of chocolate production. The management team regularly updates the chocolate bar list in order to align with the latest ratings and to ensure that the list contains bars from a variety of countries. 

They’ve asked you to collect and analyze data on the latest chocolate ratings. In particular, they’d like to know which countries produce the highest-rated bars of super dark chocolate (at least 80% cocoa). This data will help them create their next chocolate bar menu. 

Your team has received a dataset that features the latest ratings for thousands of chocolates from around the world. Click here to access the dataset. Given the data and the nature of the work you will do for your client, your team agrees to use R for this project. 

## Data Dictionary

| Field          | Description                                                                           |
|----------------|---------------------------------------------------------------------------------------|
| Company (Maker-if known)|	Name of the company manufacturing the bar.|
| Specific Bean Originor Bar Name|	The specific geo-region of origin for the bar.|
| REF|	A value linked to when the review was entered in the database. Higher = more recent.|
| ReviewDate|	Date of publication of the review.|
| CocoaPercent|	Cocoa percentage (darkness) of the chocolate bar being reviewed.|
| CompanyLocation|	Manufacturer base country.|
| Rating|	Expert rating for the bar.|
| BeanType|	The variety (breed) of bean used, if provided.|
| Broad BeanOrigin|	The broad geo-region of origin for the bean.|

## Business Task

Analyze chocolate bars to find best ratings and recommend which ones for sales. Chocolate Bar Ratings dataset containing 1700 versions are used.

## Metrics Used

Only chocolate bars with minimum ratings 4.0 and above will be considered and 80% cocoa content

## Data Cleaning

## Summary

### Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random

import statsmodels.api as sm
from statsmodels.formula.api import ols

import datetime
from datetime import datetime, timedelta

import scipy.stats

import pandas_profiling
from pandas_profiling import ProfileReport


%matplotlib inline
#sets the default autosave frequency in seconds
%autosave 60 
sns.set_style('dark')
sns.set(font_scale=1.2)

plt.rc('axes', titlesize=9)
plt.rc('axes', labelsize=14)
plt.rc('xtick', labelsize=12)
plt.rc('ytick', labelsize=12)

import warnings
warnings.filterwarnings('ignore')

# Use Folium library to plot values on a map.
#import folium

# Use Feature-Engine library
#import feature_engine
#import feature_engine.missing_data_imputers as mdi
#from feature_engine.outlier_removers import Winsorizer
#from feature_engine import categorical_encoders as ce
#from feature_engine.discretisation import EqualWidthDiscretiser, EqualFrequencyDiscretiser, DecisionTreeDiscretiser
#from feature_engine.encoding import OrdinalEncoder

pd.set_option('display.max_columns',None)
#pd.set_option('display.max_rows',None)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format','{:.2f}'.format)

random.seed(0)
np.random.seed(0)
np.set_printoptions(suppress=True)

## Exploratory Data Analysis

In [None]:
df = pd.read_csv("flavors_of_cacao.csv")

In [None]:
df

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.columns

### Groupby Function

In [None]:
df.groupby("Company")["CocoaPercent","Rating"].mean()

In [None]:
df.groupby("SpecificBeanOrigin")["CocoaPercent","Rating"].mean()

In [None]:
df.groupby("ReviewDate")["CocoaPercent","Rating"].mean()

In [None]:
df.groupby("CompanyLocation")["CocoaPercent","Rating"].mean()

In [None]:
df.groupby("BeanType")["CocoaPercent","Rating"].mean()

In [None]:
df.groupby("BroadBeanOrigin")["CocoaPercent","Rating"].mean()

In [None]:
df.Rating.value_counts()

### Pandas-Profiling Reports

In [None]:
#profile = ProfileReport(df=df, title='Cocao Report', minimal=True)

In [None]:
#profile.to_notebook_iframe()

In [None]:
#profile.to_file("Cocao_Report.html")

## Data Visualization

### Univariate Data Exploration

In [None]:
df.hist(bins=50, figsize=(20,10))
plt.suptitle('Feature Distribution', x=0.5, y=1.02, ha='center', fontsize=20)
plt.tight_layout()
plt.show()

In [None]:
df.boxplot(figsize=(20,10))
plt.suptitle('BoxPlot', x=0.5, y=1.02, ha='center', fontsize=20)
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(20,20))


g = sns.catplot(x='gender', hue = 'tenure', row = 'division',
            kind='count', data=ratings_df,
            height = 3, aspect = 1)

g.set_xlabels("Ideology")
g.set_ylabels("Bills Passed")
#g = (g.set_axis_labels("Tip","Total bill(USD)").set(xlim=(0,10),ylim=(0,100)


g.set(xlim=(0,None))
g.set_xticklabels(rotation=90)

plt.title("title")

plt.show()

In [None]:
plt.figure(figsize=(20,20))

sns.catplot(x="calories", y="restaurant",

                hue="is_salad", ci=None,

                data=df_calories, color=None, linewidth=3, showfliers = False,

                orient="h", height=20, aspect=1, palette=None,

                kind="box", dodge=True)

plt.show()

In [None]:
sns.relplot(x="age", y="eval", hue="gender",
            row="tenure",
            data=ratings_df, height = 3, aspect = 2)

### Pairplots

In [None]:
plt.figure(figsize=(20,20))
plt.suptitle('Pairplots of features', x=0.5, y=1.02, ha='center', fontsize=20)
sns.pairplot(df)
plt.show()

### Regression plot

In [None]:
line_color = {'color': 'red'}
fig , ax = plt.subplots(1,1, figsize=(5,5))

#BMI

ax = sns.regplot(x=df.CocoaPercent, y=df.Rating, line_kws=line_color, ax=ax)
ax.set_xlabel("Cocoa Percent")
ax.set_ylabel("Rating")


plt.show()

### Correlation

In [None]:
df.corr()

In [None]:
df.corr()["Rating"].sort_values()

In [None]:
plt.figure(figsize=(9,5))
sns.heatmap(df.corr(),cmap="coolwarm",annot=True,fmt='.2f',linewidths=2)
plt.title("", fontsize=20)
plt.show()

## Data Preprocessing

In [None]:
df.head()

### Drop unwanted features

In [None]:
df.columns

In [None]:
df.drop(["REF"],axis=1,inplace=True)

### Treat Missing Values

In [None]:
df.isnull().sum()

In [None]:
df['BeanType'] = df['BeanType'].replace(np.nan,"Missing")

In [None]:
df['BroadBeanOrigin'] = df['BroadBeanOrigin'].replace(np.nan,"Missing")

In [None]:
df.isnull().sum()

In [None]:
df

### Treat Duplicate Values

In [None]:
df.duplicated(keep='first').sum()

### Treat Outliers

In [None]:
df.describe(include='all')

### Type Change

In [None]:
df.info()

In [None]:
df["RatingCat"] = df["Rating"].copy()

In [None]:
df

In [None]:
df["RatingCat"] = df["RatingCat"].astype("category")

In [None]:
df.info()

In [None]:
df.describe(include='all')

## Hypothesis Testing

The goal of hypothesis testing is to answer the question, “Given a sample and an apparent effect, what is the probability of seeing such an effect by chance?” The first step is to quantify the size of the apparent effect by choosing a test statistic (t-test, ANOVA, etc). The next step is to define a null hypothesis, which is a model of the system based on the assumption that the apparent effect is not real. Then compute the p-value, which is the probability of the null hypothesis being true, and finally interpret the result of the p-value, if the value is low, the effect is said to be statistically significant, which means that the null hypothesis may not be accurate.

### T-Test

We will be using the t-test for independent samples. For the independent t-test, the following assumptions must be met.

-   One independent, categorical variable with two levels or group
-   One dependent continuous variable
-   Independence of the observations. Each subject should belong to only one group. There is no relationship between the observations in each group.
-   The dependent variable must follow a normal distribution
-   Assumption of homogeneity of variance


State the hypothesis

-   $H_0: µ\_1 = µ\_2$ ("there is no difference between cocoa and ratings")
-   $H_1: µ\_1 ≠ µ\_2$ ("there is a difference between cocoa and ratings")


## T-Test

### One Sample T-Test

In [None]:
t, p = scipy.stats.ttest_1samp(a=df.CocoaPercent, popmean=0.72)

In [None]:
print("T-test value is: ", t)
print("p-value value is: ", p)

### Two Samples T-Test

In [None]:
t, p = scipy.stats.ttest_ind(a=df.CocoaPercent,b=df.RatingCat, equal_var = False)

In [None]:
print("T-test value is: ",t)
print("p-value value is: ",p)

There is statistical significance since p-value < 0.05, Null Hypothesis is rejected

## Chi-square

State the hypothesis:

-   $H_0:$ The proportion of Ratings is independent of Bean Type
-   $H_1:$ The proportion of Ratings who are tenured is associated with Bean Type

In [None]:
#Create a Cross-tab table

cont_table  = pd.crosstab(df['BeanType'], df['RatingCat'])
cont_table

In [None]:
chi_square = scipy.stats.chi2_contingency(cont_table, correction = True)

In [None]:
print(f"Chi score is", chi_square[0])

In [None]:
print("P-value is", chi_square[1])

In [None]:
print("Degrees of freedom is", chi_square[2])

P-value > 0.05 hence both variables are independent of each other

### Correlation

State the hypothesis:

-   $H_0:$ Cocoa Percent is not correlated with Rating
-   $H_1:$ Cocoa Percent is correlated with Rating


In [None]:
pearson_correlation = scipy.stats.pearsonr(df['CocoaPercent'], df['Rating'])

In [None]:
print("Pearson's correlation coefficient is", pearson_correlation[0])

In [None]:
print("P-value is", pearson_correlation[1])

P-value < 0.05 hence Cocoa Percent has impact on Ratings. Reject Null Hypothesis

## Chocolate Ratings 4.0 and above

In [None]:
df.head()

In [None]:
df2 = df[df["Rating"] >= 4.0]

In [None]:
df2

In [None]:
df2.describe()

In [None]:
df2.reset_index(drop=True, inplace=True)

In [None]:
df2

In [None]:
df2.info()

In [None]:
#Save to csv
#df2.to_csv("chocohighrating.csv", index=False)

In [None]:
fig = plt.figure(figsize=(20,40))

plt.subplot(7,2,1)
plt.title("Ratings Counts")
sns.countplot(df2.Rating)

plt.subplot(7,2,2)
plt.title("Year Counts")
sns.countplot(df2.ReviewDate)

plt.subplot(7,2,3)
plt.title("BarChart")
sns.barplot(x=df2.ReviewDate, y=df2.Rating, data=df2)

plt.subplot(7,2,4)
plt.title("BarChart")
plt.xticks(rotation=90)
sns.barplot(x=df2.BroadBeanOrigin, y=df2.Rating, data=df2)

plt.subplot(7,2,5)
plt.title("BarChart")
plt.xticks(rotation=90)
sns.barplot(x=df2.BeanType, y=df2.Rating, data=df2)

plt.subplot(7,2,6)
plt.title("BarChart")
plt.xticks(rotation=90)
sns.barplot(x=df2.CompanyLocation, y=df2.Rating, data=df2)

plt.subplot(7,2,7)
plt.title("BarChart")
plt.xticks(rotation=90)
sns.barplot(x=df2.SpecificBeanOrigin, y=df2.Rating, data=df2)

plt.subplot(7,2,8)
plt.title("BarChart")
plt.xticks(rotation=90)
sns.barplot(x=df2.Company, y=df2.Rating, data=df2)

plt.subplot(7,2,9)
plt.title("Scatterplot")
sns.scatterplot(x=df2.CocoaPercent, y=df2.Rating, data=df2)

plt.subplot(7,2,10)
plt.title("")
sns.scatterplot()


plt.tight_layout()
plt.show()

### Save to CSV

In [None]:
df.to_csv("filename.csv", index=False)

## Regression Analysis

In [None]:
df.columns

In [None]:
y = df['ExpirationMonth']
X = df['NumStores']

In [None]:
X = sm.add_constant(X)

In [None]:
model = sm.OLS(y,X).fit()

In [None]:
model.summary()

In [None]:
prediction = model.predict(x)

### Residual Plots

In [None]:
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.plot_regress_exog(model, 'x_variables', fig=fig)

#### Python code done by Dennis Lam