# Capstone Residual

Let's change our park_location and park_area features into categories.

In [79]:
rides['park_location'].dtypes

dtype('O')

In [80]:
rides['park_location'] = rides['park_location'].astype('category')

In [81]:
rides['park_location'].dtypes

CategoricalDtype(categories=['Animal Kingdom', 'Epcot', 'Hollywood Studios',
                  'Magic Kingdom'],
, ordered=False)

In [82]:
rides['park_area'].dtypes

dtype('O')

In [83]:
rides['park_area'] = rides['park_area'].astype('category')

In [84]:
rides['park_area'].dtypes

CategoricalDtype(categories=['Adventureland', 'Africa', 'Asia', 'Dinoland USA',
                  'Echo Lake', 'Fantasyland', 'Frontierland', 'Future World',
                  'Liberty Square', 'Main Street USA', 'Pandora',
                  'Sunset Boulevard', 'Tomorrowland', 'Toy Story Land',
                  'World Showcase'],
, ordered=False)

And now let's check our datatypes again.

In [85]:
rides.dtypes

ride                                 object
park_location                      category
park_area                          category
ride_type_thrill                       bool
ride_type_spinning                     bool
ride_type_slow                         bool
ride_type_small_drops                  bool
ride_type_big_drops                    bool
ride_type_dark                         bool
ride_type_scary                        bool
ride_type_water                        bool
fast_pass                              bool
classic                                bool
age_interest_all                       bool
age_interest_preschoolers              bool
age_interest_kids                      bool
age_interest_tweens                    bool
age_interest_teens                     bool
age_interest_adults                    bool
height_req_inches                     int64
ride_duration_min                   float64
open_date                    datetime64[ns]
age_of_ride_years               

#### Convert the 'rating' column from a float to an object so it can be treated as a category for classification models.

Up until now, our 'rating' target has been treated as a float, so we haven't been able to treat it as a classification problem.

Switching our model to make it classification-oriented might help us create a better model.

But to do that, first we need to change the data type of 'rating' from float to category.

In [215]:
ride_reviews['rating'].dtypes

dtype('float64')

In [216]:
ride_reviews['rating'] = ride_reviews['rating'].astype('category')

In [217]:
ride_reviews['rating'].dtypes

CategoricalDtype(categories=[1.0, 2.0, 3.0, 4.0, 5.0], ordered=False)

In [218]:
ride_reviews.dtypes

ride                                 object
reviewer                             object
review_title                         object
review_text                          object
rating                             category
park_location                      category
park_area                          category
ride_type_thrill                       bool
ride_type_spinning                     bool
ride_type_slow                         bool
ride_type_small_drops                  bool
ride_type_big_drops                    bool
ride_type_dark                         bool
ride_type_scary                        bool
ride_type_water                        bool
fast_pass                              bool
classic                                bool
age_interest_all                       bool
age_interest_preschoolers              bool
age_interest_kids                      bool
age_interest_tweens                    bool
age_interest_teens                     bool
age_interest_adults             

Source: https://pandas.pydata.org/docs/user_guide/categorical.html

#### Joined Dataframe EDA

In [340]:
ride_reviews.groupby('ride')[['rating' , 'ta_stars']].mean().sort_values(by = 'rating', ascending = False)

Unnamed: 0_level_0,rating,ta_stars
ride,Unnamed: 1_level_1,Unnamed: 2_level_1
Avatar Flight of Passage,4.790274,5.0
Expedition Everest,4.787115,5.0
Rock 'n' Roller Coaster,4.749077,5.0
Toy Story Midway Mania,4.748408,4.5
The Twilight Zone Tower of Terror,4.748031,5.0
Soarin',4.734982,4.5
Kilimanjaro Safaris,4.725806,4.5
Splash Mountain,4.705085,4.5
Big Thunder Mountain Railroad,4.650909,4.5
Star Tours,4.542857,4.5


In [341]:
ride_reviews.groupby('ride')[['rating' , 'ta_stars']].median().sort_values(by = 'rating', ascending = False)

Unnamed: 0_level_0,rating,ta_stars
ride,Unnamed: 1_level_1,Unnamed: 2_level_1
Star Tours,5.0,4.5
Haunted Mansion,5.0,4.5
Test Track,5.0,4.5
The Twilight Zone Tower of Terror,5.0,5.0
Splash Mountain,5.0,4.5
Kilimanjaro Safaris,5.0,4.5
Rock 'n' Roller Coaster,5.0,5.0
Toy Story Midway Mania,5.0,4.5
Pirates of the Caribbean,5.0,4.5
Seven Dwarfs Mine Train,5.0,4.5


In [344]:
ride_reviews.groupby('ride')[['rating']].min().sort_values(by = 'rating', ascending = False)

Unnamed: 0_level_0,rating
ride,Unnamed: 1_level_1
Prince Charming Regal Carrousel,3.0
Dumbo the Flying Elephant,3.0
Main Street Vehicles,3.0
TriceraTop Spin,2.0
Mad Tea Party,2.0
The Barnstormer,1.0
Slinky Dog Dash,1.0
Soarin',1.0
Space Mountain,1.0
Spaceship Earth,1.0


In [345]:
ride_reviews.groupby('ride')[['rating']].max().sort_values(by = 'rating', ascending = False)

Unnamed: 0_level_0,rating
ride,Unnamed: 1_level_1
Astro Orbiter,5.0
Prince Charming Regal Carrousel,5.0
Seven Dwarfs Mine Train,5.0
Slinky Dog Dash,5.0
Soarin',5.0
Space Mountain,5.0
Spaceship Earth,5.0
Splash Mountain,5.0
Star Tours,5.0
Test Track,5.0


In [None]:
ride_reviews.groupby('ride')[['rating']].agg([np.min, np.max, np.mean, np.median])
ride_stats

In [None]:
park_location_stats = ride_reviews.groupby('park_location')[['rating']].agg([np.min, np.max, np.mean, np.median])
park_location_stats

In [None]:
park_area_stats = ride_reviews.groupby('park_area')[['rating']].agg([np.min, np.max, np.mean, np.median])
park_area_stats

In [None]:
grid_params = {
    'countvectorizer__min_df': [1,5,10],
    'countvectorizer__ngram_range': [(1,2)],
    'countvectorizer__stop_words': [None, 'english'],
    'multinomialnb__alpha': [0.1, 1, 10, 100]
}

In [None]:
grid_params = {
    'countvectorizer__min_df': [1,5,10,15,20],
    'countvectorizer__ngram_range': [(1,2)],
    'countvectorizer__stop_words': [None, 'english'],
    'multinomialnb__alpha': [0.1, 1, 10, 100]
}

In [None]:
grid_params = {
    'countvectorizer__min_df': [1,3,5],
    'countvectorizer__ngram_range': [(1,2)],
    'countvectorizer__stop_words': [None, 'english'],
    'multinomialnb__alpha': [0.1, 1, 10, 100]
}

In [None]:
{'countvectorizer__min_df': 5,
 'countvectorizer__ngram_range': (1, 2),
 'countvectorizer__stop_words': None,
 'multinomialnb__alpha': 0.1}

In [None]:
{'countvectorizer__lowercase': True,
 'countvectorizer__min_df': 5,
 'countvectorizer__ngram_range': (1, 2),
 'countvectorizer__stop_words': None,
 'multinomialnb__alpha': 0.1}

In [378]:
import pandas as pd
import numpy as np
import scipy as sp

from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

In [519]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


Source: https://stackoverflow.com/questions/29523254/python-remove-stop-words-from-pandas-dataframe

In [524]:
import nltk

In [525]:
from nltk.corpus import stopwords

In [526]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Christopher.Doughty\AppData\Roaming\nltk_data
[nltk_data]     ...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [527]:
stop = stopwords.words('english')

In [541]:
positive_words_nb.index.name = 'positive_words'

In [543]:
positive_words_nb.head()

Unnamed: 0_level_0,0,positive_words_no_stops
positive_words,Unnamed: 1_level_1,Unnamed: 2_level_1
fast pass,-6.095845,
do,-6.066806,
like,-6.042183,
line,-6.036792,
this is,-6.013764,


Source: https://stackoverflow.com/questions/20461165/how-to-convert-index-of-a-pandas-dataframe-into-a-column

In [548]:
positive_words_nb['positive_words_no_stops'] = positive_words_nb.index
positive_words_nb.head()

Unnamed: 0_level_0,0,positive_words_no_stops,index1
positive_words,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
fast pass,-6.095845,fast pass,fast pass
do,-6.066806,do,do
like,-6.042183,like,like
line,-6.036792,line,line
this is,-6.013764,this is,this is


In [556]:
positive_words_nb.rename({'0': 'coef_nb'}, axis = 1, inplace=True)
positive_words_nb

AttributeError: 'NoneType' object has no attribute 'rename'

In [545]:
positive_words_nb['positive_words_no_stops'] = positive_words_nb['positive_words']

KeyError: 'positive_words'

## Sentiment Analysis

Using Natural Language Processing, you can make a very clear sentiment analysis model and then use the results as inputs into your hypothesis test.

Sources:
<br>• https://github.com/christopherdoughty/scrapeadvisor
<br>• https://towardsdatascience.com/scraping-tripadvisor-text-mining-and-sentiment-analysis-for-hotel-reviews-cc4e20aef333

In [572]:
negative_words_nb_2.index.name = 'negative_words'

In [575]:
plt.figure(figsize=(15,10))
negative_words_nb_2['0'].value_counts().sort_values(ascending=False).plot.bar()
plt.xticks(rotation=50)
plt.xlabel("Country of Origin")
plt.ylabel("Number of Wines")
plt.show()

KeyError: '0'

<Figure size 1080x720 with 0 Axes>

In [576]:
plt.figure(figsize=(20,8))
sns.histplot(negative_words_nb_2['0'], color = 'navy')
plt.xlabel('Length of Title')
plt.ylabel('Frequency');

KeyError: '0'

<Figure size 1440x576 with 0 Axes>