## Project 3 - Reddit Challenge

# Contents:
- [1.0 Data Science Problem:](#Data-Science-Problem:)
- [1.1 Executive Summary:](#Executive-Summary:)
- [1.2 Importing libaries](#Importing-libraries) 
- [1.3 Scraping Reddit using pushshift api (https://github.com/pushshift/api)](#Scraping-Reddit-using-pushshift-api-(https://github.com/pushshift/api))
- [2. Data cleaning: Initial check](#Data-cleaning:-Initial-check)
- [2.1 Outliers](#Outliers)
- [2.2 Create data cleaning function](#Create-data-cleaning-function)
- [3. EDA: View Plots](#EDA:-View-Plots)
- [3.1 EDA: Plot a Heatmap of the Correlation Matrix](#EDA:-Plot-a-Heatmap-of-the-Correlation-Matrix)
- [3.2 EDA: Use seaborn's .pairplot() method to create scatterplots for each of our features on the target 'SalePrice'](#EDA:-Use-seaborn's-.pairplot()-method-to-create-scatterplots-for-each-of-our-features-on-the-target-'SalePrice')
- [4. Modeling: Starting Train/Test split](#Modeling:-Starting-Train/Test-split)
- [4.1 Ridge model train/test](#Ridge-model-train/test)
- [4.2 Ridge with log](#Ridge-with-log)
- [4.3 Lasso model train/test](#Lasso-model-train/test)
- [4.4 Lasso with log](#Lasso-with-log)
- [4.5 Lasso Log Model with all data](#Lasso-Log-Model-with-all-data)
- [4.6 View Lasso Log Residuals from all data](#View-Lasso-Log-Residuals-from-all-data)
- [4.7 Additional Plots](#Additional-Plots)
- [4.8 Lasso model statistics](#Lasso-model-statistics)
- [5. Load Kaggle Test Data](#Load-Kaggle-Test-Data)
- [5.1 Clean test data (match training data)](#Clean-test-data-(match-training-data))
- [5.2 Apply lasso model and generate predictions](#Apply-lasso-model-and-generate-predictions)
- [5.3 Save SalesPrice Kaggle submission to csv](#Save-SalesPrice-Kaggle-submission-to-csv)
- [5.4 Conclusion:](#Conclusion:)








# Data Science Problem:
How can we use 2006-2010 data from residential properties sold in Ames, Iowa to predict future home sales using a regression model?


# Executive Summary:

The Ames Housing Dataset is an exceptionally detailed and robust dataset with over 70 columns of different features relating to houses.
Data set contains information from the Ames Assessor’s Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010.

The data has 82 columns which include 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables (and 2 additional observation identifiers).

# Importing Libraries:

In [2]:
import requests
import re
import time
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

%matplotlib inline

# Scraping Reddit using pushshift api (https://github.com/pushshift/api)

In [5]:
# title = []
# subreddit = []
# upvotes = []
# comments = []
# subscribers = []
# created_date = []
# url = []
# id = []

# df = pd.DataFrame(columns = ['title', 'subreddit', 'comments', 'upvotes', 'subscribers', 'created_date', 'url', 'id'])

# after = None

# for a in range(20):
#     print(f"Looping through page {a+1}")
#     if after == None:
#         params = {
#         'subreddit': 'thanosdidnothingwrong',
#         'after': '1531195200',
#         'size': 1000,
#         'score': '>0',
#         'num_comments': '>0'
#         }
#     else:
#         headers = {'User-agent': 'Alex'}
#         params = {
#         'subreddit': 'thanosdidnothingwrong',
#         'after': after,
#         'size': 1000,
#         'score': '>0',
#         'num_comments': '>0'
#         }
#     link = 'https://api.pushshift.io/reddit/search/submission/'
#     res = requests.get(link, params=params, headers=headers)
#     if res.status_code == 200:
#         the_json = res.json()
#         for i in range(len(the_json['data'])):
#                     title.append(the_json['data'][i]['title'])
#                     subreddit.append(the_json['data'][i]['subreddit'])
#                     comments.append(the_json['data'][i]['num_comments'])
#                     upvotes.append(the_json['data'][i]['score'])
#                     subscribers.append(the_json['data'][i]['subreddit_subscribers'])
#                     created_date.append(the_json['data'][i]['created_utc'])
#                     url.append(the_json['data'][i]['full_link'])
#                     id.append(the_json['data'][i]['id'])
#     after = the_json['data'][i]['created_utc']
#     time.sleep(1)

# df['title'] = title
# df['subreddit'] = subreddit
# df['comments'] = comments
# df['upvotes'] = upvotes
# df['subscribers'] = subscribers
# df['created_date'] = created_date
# df['url'] = url
# df['id'] = id

In [6]:
# title = []
# subreddit = []
# upvotes = []
# comments = []
# subscribers = []
# created_date = []
# url = []
# id = []

# df1 = pd.DataFrame(columns = ['title', 'subreddit', 'comments', 'upvotes', 'subscribers', 'created_date', 'url', 'id'])

# after = None

# for a in range(17):
#     print(f"Looping through page {a+1}")
#     if after == None:
#         params = {
#         'subreddit': 'inthesoulstone',
#         'after': '1531195200',
#         'size': 1000,
#         'score': '>0',
#         'num_comments': '>0'
#         }
#     else:
#         headers = {'User-agent': 'Alex'}
#         params = {
#         'subreddit': 'inthesoulstone',
#         'after': after,
#         'size': 1000,
#         'score': '>0',
#         'num_comments': '>0'
#         }
#     link = 'https://api.pushshift.io/reddit/search/submission/'
#     res = requests.get(link, params=params, headers=headers)
#     if res.status_code == 200:
#         the_json = res.json()
#         for i in range(len(the_json['data'])):
#                     title.append(the_json['data'][i]['title'])
#                     subreddit.append(the_json['data'][i]['subreddit'])
#                     comments.append(the_json['data'][i]['num_comments'])
#                     upvotes.append(the_json['data'][i]['score'])
#                     subscribers.append(the_json['data'][i]['subreddit_subscribers'])
#                     created_date.append(the_json['data'][i]['created_utc'])
#                     url.append(the_json['data'][i]['full_link'])
#                     id.append(the_json['data'][i]['id'])
#     after = the_json['data'][i]['created_utc']
#     time.sleep(1)

# df1['title'] = title
# df1['subreddit'] = subreddit
# df1['comments'] = comments
# df1['upvotes'] = upvotes
# df1['subscribers'] = subscribers
# df1['created_date'] = created_date
# df1['url'] = url
# df1['id'] = id

In [7]:
df1

NameError: name 'df1' is not defined

# Combine `df` and `df1`

In [None]:
combined_data = pd.concat([df, df1],ignore_index=True)

In [None]:
combined_data.loc[19999]

In [None]:
combined_data['subreddit'].value_counts()

In [None]:
combined_data

In [None]:
combined_data[combined_data['upvotes'] > 1000]

In [None]:
len(title)

In [None]:
df.duplicated().sum()

In [None]:
df1.duplicated().sum()

In [None]:
df.loc[9999]

In [None]:
df1.loc[9167]

## Save dataframe to csv file

In [11]:
#combined_data.to_csv('./data/combined_data.csv')

## EDA: 
Load csv file

In [3]:
data = pd.read_csv('./data/combined_data.csv')
data.drop('Unnamed: 0',axis=1, inplace=True) #drop first column Unnamed: 0

In [4]:
data.head()

Unnamed: 0,title,subreddit,comments,upvotes,subscribers,created_date,url,id
0,For those who fell... for balance,thanosdidnothingwrong,1,2,638343,1531195214,https://www.reddit.com/r/thanosdidnothingwrong...,8xla5n
1,Join here if banned or spared.,thanosdidnothingwrong,3,1,638334,1531195219,https://www.reddit.com/r/thanosdidnothingwrong...,8xla6c
2,Wow.,thanosdidnothingwrong,4,1,638275,1531195240,https://www.reddit.com/r/thanosdidnothingwrong...,8xlaah
3,I lived?,thanosdidnothingwrong,6,1,638248,1531195248,https://www.reddit.com/r/thanosdidnothingwrong...,8xlac2
4,Test,thanosdidnothingwrong,1,1,638207,1531195265,https://www.reddit.com/r/thanosdidnothingwrong...,8xlafc


In [5]:
data['subreddit'].value_counts(normalize=True)

thanosdidnothingwrong    0.540789
inthesoulstone           0.459211
Name: subreddit, dtype: float64

## Data Cleaning:  
Use get dummies on `subreddit` column
- 1 = thanos did nothing wrong
- 0 = in the soulstone

In [6]:
data.subreddit = pd.get_dummies(data.subreddit,drop_first=True)

In [7]:
data.head()

Unnamed: 0,title,subreddit,comments,upvotes,subscribers,created_date,url,id
0,For those who fell... for balance,1,1,2,638343,1531195214,https://www.reddit.com/r/thanosdidnothingwrong...,8xla5n
1,Join here if banned or spared.,1,3,1,638334,1531195219,https://www.reddit.com/r/thanosdidnothingwrong...,8xla6c
2,Wow.,1,4,1,638275,1531195240,https://www.reddit.com/r/thanosdidnothingwrong...,8xlaah
3,I lived?,1,6,1,638248,1531195248,https://www.reddit.com/r/thanosdidnothingwrong...,8xlac2
4,Test,1,1,1,638207,1531195265,https://www.reddit.com/r/thanosdidnothingwrong...,8xlafc


## Generate Feature and target

In [8]:
X = data['title'] #feature
y = data['subreddit'] #target

# Conduct Train/Test split

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y,
                                                    test_size=0.33, 
                                                    random_state =42,
                                                    stratify=y)

In [None]:
def make_nice_conmat(y_test, preds):

    conf_matrix = confusion_matrix(y_test, preds)
    print(f'Accuracy: {accuracy_score(y_test, preds)}')
    return pd.DataFrame(conf_matrix, columns=['Predicted ' + str(i) for i in ng.target_names],\
            index=['Actual ' + str(i) for i in ng.target_names])

# Logistic Regression

# Random Forest
- with `CountVectorizer`
- with `TFIDF Vectorizer`

# use this model

In [113]:
# pipe = Pipeline([
#     ('vect', CountVectorizer()),
#     ('rf', RandomForestClassifier())
# ])

# params = {
#     'vect__ngram_range': [(1, 2)],
#     'vect__stop_words': [None, 'english'],
#     'vect__min_df': [1,2,4],
#     'rf__n_estimators':[50,100,200],
#     'rf__max_depth':[25,50,75]
# }

# gs = GridSearchCV(pipe, params, verbose=2, cv=5, n_jobs=-1)

# gs.fit(X_train, y_train)

# y_preds = gs.predict(X_test)

# print(gs.best_score_)
# print(accuracy_score(y_test, y_preds))
# gs.best_params_

Fitting 5 folds for each of 108 candidates, totalling 540 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   37.6s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-1)]: Done 357 tasks      | elapsed: 15.0min
[Parallel(n_jobs=-1)]: Done 540 out of 540 | elapsed: 33.4min finished


0.6661554604891436
0.6603850880786563


{'rf__max_depth': 75,
 'rf__n_estimators': 200,
 'vect__min_df': 1,
 'vect__ngram_range': (1, 1),
 'vect__stop_words': None}

In [104]:
pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('rf', RandomForestClassifier())
])

params = {
    'tfidf__ngram_range': [(1, 3)],
    'tfidf__stop_words': [None, 'english'],
    'tfidf__min_df': [1,2,4],
    'rf__n_estimators':[50,100,200],
    'rf__max_depth':[25,50,75]
}

gs = GridSearchCV(pipe, params, verbose=2, cv=5, n_jobs=-1)

gs.fit(X_train, y_train)

y_preds = gs.predict(X_test)

print(gs.best_score_)
print(accuracy_score(y_test, y_preds))
gs.best_params_

Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   55.0s


KeyboardInterrupt: 

# Support Vector Models

In [None]:
pipe = Pipeline([
    ('vect', CountVectorizer()),
    ('svm', SVC())
])

params = {
    'vect__ngram_range': [(1, 3)],
    'vect__stop_words': [None, 'english'],
    'svm__C': [1, 2, 3],
    'svm__kernel': ['rbf', 'linear']
}
gs = GridSearchCV(pipe, params, verbose=2, cv=5)

gs.fit(X_train, y_train)

y_preds = gs.predict(X_test)

print(gs.best_score_)
print(accuracy_score(y_test, y_preds))
gs.best_params_

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] svm__C=1, svm__kernel=rbf, vect__ngram_range=(1, 3), vect__stop_words=None 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  svm__C=1, svm__kernel=rbf, vect__ngram_range=(1, 3), vect__stop_words=None, total= 1.3min
[CV] svm__C=1, svm__kernel=rbf, vect__ngram_range=(1, 3), vect__stop_words=None 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  2.1min remaining:    0.0s


[CV]  svm__C=1, svm__kernel=rbf, vect__ngram_range=(1, 3), vect__stop_words=None, total= 1.3min
[CV] svm__C=1, svm__kernel=rbf, vect__ngram_range=(1, 3), vect__stop_words=None 
[CV]  svm__C=1, svm__kernel=rbf, vect__ngram_range=(1, 3), vect__stop_words=None, total= 1.3min
[CV] svm__C=1, svm__kernel=rbf, vect__ngram_range=(1, 3), vect__stop_words=None 
[CV]  svm__C=1, svm__kernel=rbf, vect__ngram_range=(1, 3), vect__stop_words=None, total= 1.3min
[CV] svm__C=1, svm__kernel=rbf, vect__ngram_range=(1, 3), vect__stop_words=None 
[CV]  svm__C=1, svm__kernel=rbf, vect__ngram_range=(1, 3), vect__stop_words=None, total= 1.3min
[CV] svm__C=1, svm__kernel=rbf, vect__ngram_range=(1, 3), vect__stop_words=english 
[CV]  svm__C=1, svm__kernel=rbf, vect__ngram_range=(1, 3), vect__stop_words=english, total=  38.1s
[CV] svm__C=1, svm__kernel=rbf, vect__ngram_range=(1, 3), vect__stop_words=english 
[CV]  svm__C=1, svm__kernel=rbf, vect__ngram_range=(1, 3), vect__stop_words=english, total=  37.6s
[CV] sv

[CV]  svm__C=3, svm__kernel=rbf, vect__ngram_range=(1, 3), vect__stop_words=english, total=  36.8s
[CV] svm__C=3, svm__kernel=rbf, vect__ngram_range=(1, 3), vect__stop_words=english 
[CV]  svm__C=3, svm__kernel=rbf, vect__ngram_range=(1, 3), vect__stop_words=english, total=  43.3s
[CV] svm__C=3, svm__kernel=rbf, vect__ngram_range=(1, 3), vect__stop_words=english 
[CV]  svm__C=3, svm__kernel=rbf, vect__ngram_range=(1, 3), vect__stop_words=english, total=  39.4s
[CV] svm__C=3, svm__kernel=rbf, vect__ngram_range=(1, 3), vect__stop_words=english 
[CV]  svm__C=3, svm__kernel=rbf, vect__ngram_range=(1, 3), vect__stop_words=english, total=  36.6s
[CV] svm__C=3, svm__kernel=linear, vect__ngram_range=(1, 3), vect__stop_words=None 
[CV]  svm__C=3, svm__kernel=linear, vect__ngram_range=(1, 3), vect__stop_words=None, total= 4.6min
[CV] svm__C=3, svm__kernel=linear, vect__ngram_range=(1, 3), vect__stop_words=None 
[CV]  svm__C=3, svm__kernel=linear, vect__ngram_range=(1, 3), vect__stop_words=None, 

In [None]:
pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('svm', SVC())
])

params = {
    'tfidf__ngram_range': [(1, 3)],
    'tfidf__stop_words': [None, 'english'],
    'svm__C': [1, 2, 3, 4],
    'svm__kernel': ['rbf', 'poly', 'linear']
}
gs = GridSearchCV(pipe, params, verbose=2, cv=5,n_jobs=2)

gs.fit(X_train, y_train)

y_preds = gs.predict(X_test)

print(gs.best_score_)
print(accuracy_score(y_test, y_preds))
gs.best_params_

In [13]:
log_tfidf_pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('log_reg', LogisticRegression())
])

log_tfidf_params = {
    'tfidf__ngram_range': [(1,3)],
    'tfidf__stop_words': [None, 'english'],
    'log_reg__penalty': ['l1', 'l2'],
    'log_reg__C': [1,10,100]
}

log_tfidf_gs = GridSearchCV(log_tfidf_pipe, log_tfidf_params, verbose=2, cv=5, n_jobs=-1)

log_tfidf_gs.fit(X_train, y_train)

y_preds = log_tfidf_gs.predict(X_test)

print(log_tfidf_gs.best_score_)
print(accuracy_score(y_test, y_preds))
log_tfidf_gs.best_params_

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


KeyboardInterrupt: 

In [12]:
log_vect_pipe = Pipeline([
    ('vect', CountVectorizer()),
    ('log_reg', LogisticRegression())
])

log_vect_params = {
    'vect__ngram_range': [(1, 3)],
    'vect__stop_words': [None, 'english'],
    'log_reg__penalty': ['l1', 'l2'],
    'log_reg__C': [1,10,100]
}

log_vect_gs = GridSearchCV(log_vect_pipe, log_vect_params, verbose=2, cv=5, n_jobs=2)

log_vect_gs.fit(X_train, y_train)

y_preds = log_vect_gs.predict(X_test)

print(log_vect_gs.best_score_)
print(accuracy_score(y_test, y_preds))
log_vect_gs.best_params_

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  37 tasks      | elapsed:  1.4min
[Parallel(n_jobs=2)]: Done  60 out of  60 | elapsed:  3.0min finished


0.657760916942449
0.6578451454322


{'log_reg__C': 1,
 'log_reg__penalty': 'l2',
 'vect__ngram_range': (1, 3),
 'vect__stop_words': None}

# Boost

In [86]:
# pipe = Pipeline([
#     ('vect', CountVectorizer()),
#     ('ada', AdaBoostClassifier())
# ])

# params = {
#     'vect__ngram_range': [(1, 1), (1, 2)],
#     'vect__stop_words': [None, 'english'],
#     'ada__base_estimator': [DecisionTreeClassifier(max_depth=1),DecisionTreeClassifier(max_depth=2),DecisionTreeClassifier(max_depth=3)],
#     'ada__n_estimators': [50, 100]
# }

# gs = GridSearchCV(pipe, params, verbose=2, cv=5, n_jobs=-1)

# gs.fit(X_train, y_train)

# y_preds = gs.predict(X_test)

# print(gs.best_score_)
# print(accuracy_score(y_test, y_preds))
# gs.best_params_

Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   49.9s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:  4.0min finished


0.6548147550246186
0.6551413355182303


{'ada__base_estimator': DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
             max_features=None, max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, presort=False, random_state=None,
             splitter='best'),
 'ada__n_estimators': 100,
 'vect__ngram_range': (1, 2),
 'vect__stop_words': None}

In [88]:
pipe = Pipeline([
    ('vect', CountVectorizer()),
    ('gb', GradientBoostingClassifier())
])

params = {
    'vect__ngram_range': [(1, 2)],
    'vect__stop_words': [None, 'english'],
    'gb__loss': ['deviance', 'exponential'],
    'gb__n_estimators': [50, 100],
    'gb__max_depth': [3, 4]
}

gs = GridSearchCV(pipe, params, verbose=2, cv=5, n_jobs=-1)

gs.fit(X_train, y_train)

y_preds = gs.predict(X_test)

print(gs.best_score_)
print(accuracy_score(y_test, y_preds))
gs.best_params_

Fitting 5 folds for each of 32 candidates, totalling 160 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 160 out of 160 | elapsed:  7.2min finished


0.6500524658971668
0.6498156493240476


{'gb__loss': 'exponential',
 'gb__max_depth': 4,
 'gb__n_estimators': 100,
 'vect__ngram_range': (1, 2),
 'vect__stop_words': None}

# Bagging Classifier

In [139]:
# bag_pipe = Pipeline([
#     ('vect', CountVectorizer()),
#     ('bag', BaggingClassifier())
# ])

# bag_params = {
#     'vect__ngram_range': [(1, 2)],
#     'vect__stop_words': [None, 'english'],
#     'vect__strip_accents': ['ascii', 'unicode'],
#     'bag__base_estimator': [DecisionTreeClassifier(max_depth=2), DecisionTreeClassifier(max_depth=3), DecisionTreeClassifier(max_depth=4)],
#     'bag__n_estimators': [10, 50, 100],
# }

# bag_gs = GridSearchCV(pipe, params, verbose=2, cv=5, n_jobs=-1)

# bag_gs.fit(X_train, y_train)

# y_preds = bag_gs.predict(X_test)

# print(bag_gs.best_score_)
# print(accuracy_score(y_test, y_preds))
# bag_gs.best_params_

Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   42.7s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed: 22.2min
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed: 31.8min finished


0.6634110904834934
0.6651372388365424


{'log_reg__C': 1,
 'log_reg__penalty': 'l2',
 'tfidf__min_df': 1,
 'tfidf__ngram_range': (1, 3),
 'tfidf__stop_words': None,
 'tfidf__strip_accents': 'unicode'}

In [None]:
bag_pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('bag', BaggingClassifier())
])

bag_params = {
    'tfidf__ngram_range': [(1, 2)],
    'tfidf__stop_words': [None, 'english'],
    'tfidf__strip_accents': ['ascii', 'unicode'],
    'bag__base_estimator': [DecisionTreeClassifier(max_depth=2), DecisionTreeClassifier(max_depth=3), DecisionTreeClassifier(max_depth=4)],
    'bag__n_estimators': [10, 50, 100],
}

bag_gs = GridSearchCV(pipe, params, verbose=2, cv=5, n_jobs=-1)

bag_gs.fit(X_train, y_train)

y_preds = bag_gs.predict(X_test)

print(bag_gs.best_score_)
print(accuracy_score(y_test, y_preds))
bag_gs.best_params_

In [17]:
coef_df = pd.DataFrame(bag.coef_, columns=cv1.get_feature_names()).T
coef_df['Absolute'] = coef_df[0].abs()
coef_df

NameError: name 'bag' is not defined

In [None]:
nb = MultinomialNB()

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

bayes.pipe = Pipeline([
    ('normalizer', TextNormalizer()),
    ('vectorizer', GensimVectorizer()),
    ('bayes', MultinomialNB()),
])

In [None]:
model.set_params(onehot__threshold=3.0)


In [None]:
from sklearn.model_selection import GridSearchCV

search = GridSearchCV(model, param_grid={
    'count__analyzer': ['word', 'char', 'char_wb'],
    'count__ngram_range': [(1,1), (1,2), (1,3), (1,4), (1,5), (2,3)],
    'onehot__threshold': [0.0, 1.0, 2.0, 3.0],
    'bayes__alpha': [0.0, 1.0],
})

In [None]:
vect = CountVectorizer()

# this normalizes each term frequency by the 
# number of documents having that term
tfidf = TfidfVectorizer()
log_reg = LogisticRegression()

pipeline = Pipeline([
    ('vect',vect),
    ('tfidf',tfidf),
    ('log_reg')
])

# call fit as you would on any classifier
pipeline.fit(X_train,y_train)

# predict test instances
y_preds = pipeline.predict(X_test)

# calculate f1
mean_f1 = f1_score(y_test, y_preds, average='micro')

In [None]:
param_grid = {
    'vect__max_df':[0.8,0.9,1.0],
    'clf__C':[0.1,1.0]
}

# do 3-fold cross validation for each of the 6 possible
# combinations of the parameter values above
grid = GridSearchCV(pipeline, cv=3, param_grid=param_grid)
grid.fit(X_train,y_train)
