<a href="https://colab.research.google.com/github/YSSF934/thinkful_students_projects/blob/main/Final_Capstone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Thinkful Final Capstone**

###**Cannabis Data Analysis**

The purpose of this Notebook is to analyze data within a csv containing records of cannabis strains sourced from Kaggle.

Only the "rating", strain", and "type fields from the dataset will be used:

- Rating is defined as a score between 1 and 5 with decimal precision.

- Strain is defined as the name of the cannabis.

- Type is defined as the type of strain which can be either "Hybrid", "Sativa", or "Indica".

This project will focus on testing hypotheses surrounding the rating of strains based on their types. The hypotheses are as follows:

###**Hypothesis #1**

**Null:** There is no correlation between the types of strains and their ratings. This means the ratings of types are not distributed similarly.

**Alternative:** There is a correlation between the types of strains and their ratings. This means the ratings of types are distributed similarly.

###**Hypothesis #2**

**Null:** There is no statistically significant difference between ratings depending on the type of the strain.

**Alernative:** There is a statistically significant difference between ratings depending on the type of the strain.



In [1]:
# Execute all imports, set warning filter, and mount gdrive to connect data
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math
from scipy.stats.stats import ttest_ind
from scipy import stats
sns.set()
from google.colab import drive
drive.mount('/content/gdrive')
import warnings

warnings.filterwarnings('ignore')

Mounted at /content/gdrive


After we have made a successful connection to the data we will contain the dataframe within a variable named "df" and explore the columns it has.

In [2]:
# Contain dataframe within "df" variable using pandas to contain path within 
# .read_csv function
# View columns
df = pd.read_csv('/content/gdrive/MyDrive/Colab Datasets/cannabis.csv')
df.columns

FileNotFoundError: ignored

Let's explore the data further before diving into analysis.

In [None]:
# Peer into the data using .info() function
df.info()

We have now discovered null values within the data. For this analysis we will only be using complete records so we must remove these nulls.

In [None]:
# Use .dropna() function to drop all null values from the dataframe and contain
# this within a new variable called "df_filtered"
df_filtered = df.dropna()

Let's confirm that our data is clean by verifying all columns have the same amount of records.

In [None]:
# Use the .info() function on the newly created "df_filtered" variable
df_filtered.info()

Now we only have complete rows of records within our data. An additional level of purity is to purge the ratings with a value of 0.

In [None]:
# Filter out records that have 0 value and contain within "above_0" variable
# Locate values using .iloc[variable.values] and contain within "df_clean" 
# variable
above_0 = df_filtered['rating'] > 0
df_clean = df_filtered.iloc[above_0.values]

We now have cleaned our data thoroughly to the point where we have 2211 rows of complete records. Let's seperate this data into three additional dataframes. One for each type: "Hybrid", "Indica", and "Sativa".

In [None]:
# Create filter variables for each type using "df_clean" and similar method from
# previous coding cell
# Contain newly filtered dataframes within variables named after each type
hybrid_filter = df_clean['type'] == 'hybrid'
indica_filter = df_clean['type'] == 'indica'
sativa_filter = df_clean['type'] == 'sativa'

hybrid = df_clean.iloc[hybrid_filter.values]
indica = df_clean.iloc[indica_filter.values]
sativa = df_clean.iloc[sativa_filter.values]

Finally, we have everything in place to begin testing our hypotheses starting with:

##**Hypothesis #1**

**Hypothesis:** There is a correlation between the type of strains and their ratings. Meaning the ratings of hybrid, sativa, and indica type strains are distributed similarly.

Since we have two string fields and one numeric field we will assess this correlation visually.

Let's start with a scatterplot containing the distribution of all types seperated by color.


In [None]:
# Set a figure size and create scatterplot with custom attributes
plt.figure(figsize = (10,5))
a1 = sns.scatterplot(x="rating", y="strain", hue="type", x_bins=1, 
                     data=df_clean)
a1.set(yticklabels=[])
a1.set(title='Type Ratings (All Types, Scatterplot)')
a1.set(xlabel='Rating')
a1.set(ylabel='Strain')
plt.xlim(0, 5.5)
plt.show()

This gives us insight as to where the majority of ratings lie within all types. We can quicky see ratings are typically 3 and above with a strong concentration between 4 and 5. How similar is this distribution when the types are not overlapped? Let's seperate the types in their own graphs.

In [None]:
# Set a figure size and create scatterplots with custom attributes
plt.figure(figsize=(10,5))
hy1 = sns.scatterplot(x="rating", y="strain", hue="type", legend=False, 
                      data=hybrid, palette=['blue'])
hy1.set(yticklabels=[])
hy1.set(title='Hybrid Ratings (Scatterplot')
hy1.set(xlabel='Rating')
hy1.set(ylabel='Strain')
plt.xlim(0, 5.1)
plt.show()

plt.figure(figsize=(10,5))
ind1 = sns.scatterplot(x="rating", y="strain", hue="type", legend=False,
                       data=sativa, palette=['darkorange'])
ind1.set(yticklabels=[])
ind1.set(title='Sativa Ratings (Scatterplot)')
ind1.set(xlabel='Rating')
ind1.set(ylabel='Strain')
plt.xlim(0, 5.1)
plt.show()

plt.figure(figsize=(10,5))
sa1 = sns.scatterplot(x="rating", y="strain", hue="type", legend=False,
                      data=indica, palette=['green'])
sa1.set(yticklabels=[])
sa1.set(title='Indica Ratings (Scatterplot)')
sa1.set(xlabel='Rating')
sa1.set(ylabel='Strain')
plt.xlim(0, 5.1)
plt.show()

These visuals can tell us that the distribution between types in regards to ratings are similar. From a glance we can see that rarely any strains of any type are rated below 4 and regardless of the type the majorify of ratings fall between 4 and 5. 

This supports the hypothesis that a correlation exists, however, before we finalize a summary we can add another layer of analysis by viewing our data in histograms.

In [None]:
# Set a figure size and create histogram with custom attributes
plt.figure(figsize=(10, 5))
a2 = sns.histplot(data=df_clean, x="rating", hue="type", kde=True)
a2.set(title='Type Ratings (All Types, Histogram)')
a2.set(xlabel='Rating')
a2.set(ylabel='Strain')
plt.xlim(0, 5.1)
plt.ylim(0, 160)
plt.show()

By analyzing this histogram we can see that the counts of strains differ yet the kernal density estimate of each type is very similar. As we did before to gain further insight we can seperate these types into their own graphs.

In [None]:
# Set a figure size and create histograms with custom attributes
plt.figure(figsize=(10, 5))
hy2 = sns.histplot(data=hybrid, x="rating", kde=True, legend=False, 
                   color='blue')
hy2.set(title='Hybrid Ratings (Histogram)')
hy2.set(xlabel='Rating')
hy2.set(ylabel='Strain Count')
plt.xlim(0, 5.1)
plt.ylim(0, 160)
plt.show()

plt.figure(figsize=(10, 5))
sa2 = sns.histplot(data=sativa, x="rating", kde=True, legend=False, 
                   color='darkorange' )
sa2.set(title='Sativa Ratings (Histogram)')
sa2.set(xlabel='Rating')
sa2.set(ylabel='Strain Count')
plt.xlim(0, 5.1)
plt.ylim(0, 160)
plt.show()

plt.figure(figsize=(10, 5))
ind2 = sns.histplot(data=indica, x="rating", kde=True, legend=False, 
                    color="green")
ind2.set(title='Indica Ratings (Histogram)')
ind2.set(xlabel='Rating')
ind2.set(ylabel='Strain Count')
plt.xlim(0, 5.1)
plt.ylim(0, 160)
plt.show()

Now we can see that although the counts of strains differ the distributions of ratings are similar.

##**Summary of findings for hypothesis #1:**

The scatter plots show a strong correlation for ratings across all three types. The histogram graphs show strong correlations as well through kernal density and similar distribution. This means that ratings are similar amongst all types. **Hypothesis supported.**



##**Hypothesis #2**

**Hypothesis:** There is a statistically significant difference between ratings depending on the type of the strain.

To start we can view descriptive statistics about the ratings of types. Let's build a function to achieve this.

In [None]:
# Define function to display descriptive statistics
def statistics(column):
  print('The max value in the column: {}'.format(column.max()))
  print('The min value in the column: {}'.format(column.min()))
  print('The mode value in the column: {}'.format(column.mode()))
  print('The median value in the column: {}'.format(column.median()))
  print('The mean value in the column: {}'.format(column.mean()))
  print('The std of the column: {}'.format(column.std()))

Now we can run descriptive statistics on each one of our types.

In [None]:
# Use function with hybrid ratings
statistics(hybrid['rating'])

In [None]:
# Use function with sativa ratings
statistics(sativa['rating'])

In [None]:
# Use function with indica ratings
statistics(indica['rating'])

In [None]:
# Group by type and show the average ratings
df_clean.groupby('type').mean()['rating']

This gives us some preliminary insight, specifically we wanted the means as this will be the center focus of our t-tests.

Now we can begin our two tail t-tests on the ratings of hybrid & sativa, hybrid & indica, and sativa & indica types.

We can start with hybrid & sativa:

In [None]:
# Perform t-test on hybrid and sativa ratings
stats.ttest_ind(hybrid['rating'], sativa['rating'])

Next we can store components of our confidence interval within variables.

In [None]:
# Set components for test 1
t1s1_n = hybrid.shape[0]
t1s2_n = sativa.shape[0]
t1s1_mean = hybrid['rating'].mean()
t1s2_mean = sativa['rating'].mean()
t1s1_var = hybrid['rating'].var()
t1s2_var = sativa['rating'].var()

Finally we are ready to establish a confidence interval and produce a more informative result.

In [None]:
# Create standard error of difference, mean difference, margin of error, and
# confidence interval(lower,upper) variables
# Print full result of t-test
std_err_difference = math.sqrt((t1s1_var/t1s1_n)+(t1s2_var/t1s2_n))

mean_difference = t1s2_mean - t1s1_mean

margin_of_error = 1.96 * std_err_difference
ci_lower = mean_difference - margin_of_error
ci_upper = mean_difference + margin_of_error

print("The difference in means at the 95% confidence interval is between "
+str(ci_lower)+" and "+str(ci_upper)+".")

With a pvalue of 0.4 > .05 we can fail to reject the null. There is no statisically significant difference between the means of hybrid and sativa ratings. The 95% confidence interval is between -.05 and .02 which is negligible. This test supports the null hypothesis.

To continue we will take similar steps to test hybrid & indica:



In [None]:
# Perform t-test on hybrid and indica ratings
stats.ttest_ind(hybrid['rating'], indica['rating'])

In [None]:
# Set components for test 2
t2s1_n = hybrid.shape[0]
t2s2_n = indica.shape[0]
t2s1_mean = hybrid['rating'].mean()
t2s2_mean = indica['rating'].mean()
t2s1_var = hybrid['rating'].var()
t2s2_var = indica['rating'].var()

In [None]:
# Create standard error of difference, mean difference, margin of error, and
# confidence interval(lower,upper) variables
# Print full result of t-test
std_err_difference = math.sqrt((t2s1_var/t2s1_n)+(t2s2_var/t2s2_n))

mean_difference = t2s2_mean - t2s1_mean

margin_of_error = 1.96 * std_err_difference
ci_lower = mean_difference - margin_of_error
ci_upper = mean_difference + margin_of_error

print("The difference in means at the 95% confidence interval is between "
+str(ci_lower)+" and "+str(ci_upper)+".")

With a pvalue of .86 > .05 we can fail to reject the null. There is no statistically significant difference between the means of hybrid and indica ratings. The 95% confidence interval is between -.03 and .03 which is negligible. This test supports the null hypothesis.

Lastly we will repeat to test sativa & indica:

In [None]:
# Perform t-test on sativa and indica ratings
stats.ttest_ind(sativa['rating'], indica['rating'])

In [None]:
# Set components for test 3
t3s1_n = sativa.shape[0]
t3s2_n = indica.shape[0]
t3s1_mean = sativa['rating'].mean()
t3s2_mean = indica['rating'].mean()
t3s1_var = sativa['rating'].var()
t3s2_var = indica['rating'].var()

In [None]:
# Create standard error of difference, mean difference, margin of error, and
# confidence interval(lower,upper) variables
# Print full result of t-test
std_err_difference = math.sqrt((t3s1_var/t3s1_n)+(t3s2_var/t3s2_n))

mean_difference = t3s2_mean - t3s1_mean

margin_of_error = 1.96 * std_err_difference
ci_lower = mean_difference - margin_of_error
ci_upper = mean_difference + margin_of_error

print("The difference in means at the 95% confidence interval is between "
+str(ci_lower)+" and "+str(ci_upper)+".")

With a pvalue of .41 > .05 we can fail to reject the null. There is no statistically significant difference between the means of sativa and indica ratings. The 95% confidence interval is between -.02 and .05 which is negligible. This test supports the null hypothesis.

Before finalizing a summary it is good practice to verify the accuracy of the analysis. To do this we will create a function to define a confidence interval and check that the results match the previous tests.

In [None]:
# Define function to display confidence interval
def get_95_ci(array1, array2):
  sample1_n = array1.shape[0]
  sample2_n = array2.shape[0]
  sample1_mean = array1.mean()
  sample2_mean = array2.mean()
  sample1_var = array1.var()
  sample2_var = array2.var()
  mean_difference = sample2_mean - sample1_mean
  std_err_difference = math.sqrt((sample1_var/sample1_n)+(sample2_var/sample2_n))
  margin_of_error = 1.96 * std_err_difference
  ci_lower = mean_difference - margin_of_error
  ci_upper = mean_difference + margin_of_error
  return("The difference in means at the 95% confidence interval is between "
  +str(ci_lower)+" and "+str(ci_upper)+".")

We can now call the function with each of our previously paired tested arrays.

In [None]:
get_95_ci(hybrid['rating'], sativa['rating'])

In [None]:
get_95_ci(hybrid['rating'], indica['rating'])

In [None]:
get_95_ci(sativa['rating'], indica['rating'])

With our test results verified we conclude our analysis.

###**Summary of findings for hypothesis #2:**

The t-tests all conclude that there is no significant difference between ratings of hybrid, sativa, and indica types. Based on the large pvalues there is evidence to support the null present within each test. **Hypothesis rejected.**

##**Recommendations:**

Analysis into the first hypothesis has revealed that a correlation between the types of strains exists. This relation gives us insight into the distribution of ratings which is similar across all types. We also extracted insight into the volume of each type based on their ratings. This information is useful to know as it demonstrates the appreciation dispersed amongst all the types. It would be advised to increase the volume of sativa and indica types considering they rate similary and their volumes are lower than that of hybrids. 

Analysis into the second hypothesis has revealed that a significant difference between the types of strains does not exist. This gives us insight into how evenly the types are rated. Considering how close these ratings are it is evident that all types have a viable market. A healthy balance of each type is recommended to maintain based on the analysis of the second hypothesis.

**Source: kaggle.com/datasets/nvisagan/cannabis-strains-features**

**Notebook created by: Ahmad Zahid Sharif**