# Now it's your turn!

Use the following dataset of scraped "Data Scientist" and "Data Analyst" job listings to create your own Document Classification Models.

<https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv>

Requirements:

- Apply both CountVectorizer and TfidfVectorizer methods to this data and compare results
- Use at least two different classification models to compare differences in model accuracy
- Try to "Hyperparameter Tune" your model by using different n_gram ranges, max_results, and data cleaning methods
- Try and get the highest accuracy possible!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import re
import string
from bs4 import BeautifulSoup
from urllib.request import urlopen

import nltk
nltk.download('stopwords')
from nltk.tokenize import sent_tokenize # Sentence Tokenizer
from nltk.tokenize import word_tokenize # Word Tokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.probability import FreqDist

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Gutierrez\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [3]:
postings = []

url = "https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv"

data = urlopen(url)
for line in data:
    html_doc = line
    soup = BeautifulSoup(html_doc, 'html.parser')
    postings.append(soup.get_text())

In [4]:
# Deleting first item in list
postings.pop(0)

'description,title,job\n'

In [5]:
post_dict = {'description': [], 'title': [], 'job': []}

for posting in postings:
    # Spliting at quote tickmakrs and comma
    #posting = re.split(r'",|\',', posting)[0]
    posting = posting.strip('b\"\'')
    posting = posting.rstrip('\n')
    # Converting `\\n` into space and joining
    posting = (' ').join(posting.split('\\n'))
    # Convering `/` into spaces
    posting = (' ').join(posting.split('/'))
    posting = posting.split(',')
    post_dict['description'].append((' ').join(posting[:-2]))
    post_dict['title'].append(posting[-2])
    post_dict['job'].append(posting[-1])

In [6]:
df = pd.DataFrame(post_dict)

In [7]:
df.job.value_counts()

Data Scientist    250
Data Analyst      250
Name: job, dtype: int64

In [8]:
df['job'] = df.job.map({'Data Analyst': 0, 'Data Scientist': 1})

In [9]:
df.head()

Unnamed: 0,description,title,job
0,Job Requirements: Conceptual understanding in ...,Data scientistÂ,1
1,Job Description As a Data Scientist 1 you wi...,Data Scientist I,1
2,As a Data Scientist you will be working on con...,Data Scientist - Entry Level,1
3,$4 969 - $6 756 a monthContractUnder the gener...,Data Scientist,1
4,Location: USA \xe2\x80\x93 multiple locations ...,Data Scientist,1


In [10]:
from sklearn.model_selection import train_test_split

# Since CountVectorizer can only work with 1 column
# We merge `description` and `title` columns
X = df.description + df.title
y = df.job

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((400,), (100,), (400,), (100,))

In [12]:
X_train.values[0]

'Company Overview  Digital Assets Data is a leader in helping sophisticated institutional investors understand the cryptoasset markets through unique data sets and insightful analysis. A fast-growing startup with over $5m in seed funding  Digital Assets Data has already rolled out software to some of the top crypto funds.  Data Scientist Role  Digital Assets Data is looking for a Data Scientist to join our analytics team based in Denver  CO. As a Data Scientist  you\\xe2\\x80\\x99ll be a part of a high performing team that is leading the disruption of the crypto investment industry. We are looking for someone who is motivated and passionate about designing novel trading indicators  indices  and analyses in the emerging field of cryptoanalysis as well as producing thought leadership that will help shape how investors think about and value the space. With extremely large datasets consisting of billions of records from exchanges and blockchains  you will be utilizing data that very few in

## CountVectorizer

### First Round - Using all Words/Features

Shouldn't we run a more thorough cleaning of words prior to this step??

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')
vectorizer.fit(X_train)

print(vectorizer.vocabulary_)

{'company': 1691, 'overview': 5611, 'digital': 2425, 'assets': 803, 'data': 2167, 'leader': 4588, 'helping': 3784, 'sophisticated': 7272, 'institutional': 4202, 'investors': 4353, 'understand': 8118, 'cryptoasset': 2093, 'markets': 4883, 'unique': 8138, 'sets': 7058, 'insightful': 4180, 'analysis': 623, 'fast': 3198, 'growing': 3674, 'startup': 7429, '5m': 209, 'seed': 6985, 'funding': 3462, 'rolled': 6787, 'software': 7245, 'crypto': 2091, 'funds': 3464, 'scientist': 6922, 'role': 6784, 'looking': 4759, 'join': 4426, 'analytics': 633, 'team': 7742, 'based': 985, 'denver': 2300, 'xe2': 8672, 'x80': 8562, 'x99ll': 8601, 'high': 3797, 'performing': 5750, 'leading': 4591, 'disruption': 2521, 'investment': 4351, 'industry': 4088, 'motivated': 5174, 'passionate': 5689, 'designing': 2344, 'novel': 5369, 'trading': 7945, 'indicators': 4076, 'indices': 4077, 'analyses': 621, 'emerging': 2795, 'field': 3237, 'cryptoanalysis': 2092, 'producing': 6127, 'thought': 7854, 'leadership': 4590, 'help':

In [14]:
train_word_counts  = vectorizer.transform(X_train)
X_train_vectorized = pd.DataFrame(train_word_counts.toarray(),
                                 columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(400, 8723)


Unnamed: 0,00,000,00011236,00079,00805,00am,00pm,01,02115,03,0356,04,062,06366,07,0730,09,093,10,100,1000,10003093,1001,100k,100x,101,103,105,1082692,10g,10ms,10x,11,1100,113,114157802,114182628,116,118,12,120,125,126,13,130,1315,1324b,136,14,140,1400,143,15,150,1500,15000,15454,16,160,169334br,17,170,17b,17th,18,180,188,18b,19,190007ji,19000bn4,19000c1k,19002353,19004162,1901744,19050,1938,1944,1947,1967,1969,1970,1974,1988,199,1990,1996,1997,1999,1b,1st,1strategy,1tb,20,200,2000,200041690,200046821,2003,2004,2005,2006,2007,2008,201,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,202,2020,2021,2030,2050,207,20m,21,210,2145,21st,22,222,2272,23,24,24951,24m,25,26,260,27,270,271,28,29,2b,2dreaminc,2nd,2qxtl4a,2x,30,300,30328,309,30am,31,310,311,3127828,3148,317,3188439,320,33,3316,33634,338,35,350,36,360,374,38,39,396,3rd,3x,40,400,401,401k,4075,42,43,447618,45,450,457,461,485,4901,4919,4g,4th,50,500,5000,503,50k,51,515,52799,539,540,55,56,561,57,57th,58,59,59pm,5g,5m,5pm,5th,60,600,60k,613,62,64,6468,65,650,67,670,68,6890,6903,695,70,700,711,72,734,75,750,755,756,78,78205,78216,784,80,800,80bn,82,83,832,835experience,837,844,85,...,wet,wetlands,wework,wfa,wfatl,wfh,whataburger,whats,whd,whilst,white,wholly,whopping,whys,wicked,wide,widely,wider,wild,wildlife,willing,willingness,win,windowing,windows,winners,winning,wins,wipro,wireless,wisdom,wish,wishes,withdraw,wix,wizard,wk,woes,women,won,word,word2vec,words,work,workable,workday,workday_recruiting,worked,worker,workers,workersdevelop,workflow,workflows,workforce,working,workload,workplace,workplaces,workplans,works,workshops,worksites,workstations,world,worldatwork,worldgrid,worldline,worlds,worldwide,worrying,worth,worthy,wotif,wrangler,wrangling,write,writer,writers,writes,writing,written,writtenproven,wrk,wss,wte,wwe,www,www1,x12,x80,x81ciency,x81ed,x81eld,x81nd,x81ndings,x81ned,x81table,x82,x83,x84,x8bthis,x8bwe,x93,x93churn,x93minorities,x94,x94a,x94and,x94combined,x94cryptocurrencies,x94in,x94including,x94investigations,x94legal,x94no,x94senior,x94something,x94supporting,x94the,x94to,x94we,x98,x98always,x98big,x98think,x99,x99d,x99i,x99ll,x99re,x99s,x99t,x99ve,x9c,x9c10,x9cabove,x9camerica,x9cbang,x9cbank,x9cbig,x9cbusy,x9cclean,x9cedr,x9cexcellent,x9cgtb,x9chr,x9ci,x9cit,x9cjob,x9cjust,x9clive,x9cmaintenance,x9cmake,x9cmvp,x9cnca,x9cpeople,x9cppc,x9cpr,x9cprovide,x9creasonable,x9crivr,x9crunway,x9csearch,x9cshare,x9cstem,x9csurge,x9cthe,x9ctranslate,x9cwe,x9cwhen,x9cwhy,x9cwow,x9d,xa0,xa2,xa2gather,xa2hands,xa2knowledge,xa2preferred,xa2showcase,xa2work,xa6,xa6protect,xa6rapidly,xa9,xa9al,xac,xae,xafve,xb7,xb7experience,xb7identify,xb7knowledge,xbb,xbf,xbfs,xbox,xc2,xc3,xe2,xef,xgboost,xj6,xml,xpo,yard,year,yeara,yearas,yearbenefits,yeardata,yeardescription,yeargcc,yeargrowing,yearjob,yearjr,yearly,years,yearsexperience,yearsummary,yearthe,yearthere,yeartitle,yearunder,yearworking,yearworks,yes,yield,yoga,york,young,youth,youtube,yr,yrs,yyyy,zeiss,zendesk,zero,zeta,zetahub,zeus,zheng,zillow,zogsports,zoho,zone,zones,zoom,zywave
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,1,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,2,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,7,13,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,1,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [15]:
test_word_counts = vectorizer.transform(X_test)
X_test_vectorized = pd.DataFrame(test_word_counts.toarray(),
                                columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(100, 8723)


Unnamed: 0,00,000,00011236,00079,00805,00am,00pm,01,02115,03,0356,04,062,06366,07,0730,09,093,10,100,1000,10003093,1001,100k,100x,101,103,105,1082692,10g,10ms,10x,11,1100,113,114157802,114182628,116,118,12,120,125,126,13,130,1315,1324b,136,14,140,1400,143,15,150,1500,15000,15454,16,160,169334br,17,170,17b,17th,18,180,188,18b,19,190007ji,19000bn4,19000c1k,19002353,19004162,1901744,19050,1938,1944,1947,1967,1969,1970,1974,1988,199,1990,1996,1997,1999,1b,1st,1strategy,1tb,20,200,2000,200041690,200046821,2003,2004,2005,2006,2007,2008,201,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,202,2020,2021,2030,2050,207,20m,21,210,2145,21st,22,222,2272,23,24,24951,24m,25,26,260,27,270,271,28,29,2b,2dreaminc,2nd,2qxtl4a,2x,30,300,30328,309,30am,31,310,311,3127828,3148,317,3188439,320,33,3316,33634,338,35,350,36,360,374,38,39,396,3rd,3x,40,400,401,401k,4075,42,43,447618,45,450,457,461,485,4901,4919,4g,4th,50,500,5000,503,50k,51,515,52799,539,540,55,56,561,57,57th,58,59,59pm,5g,5m,5pm,5th,60,600,60k,613,62,64,6468,65,650,67,670,68,6890,6903,695,70,700,711,72,734,75,750,755,756,78,78205,78216,784,80,800,80bn,82,83,832,835experience,837,844,85,...,wet,wetlands,wework,wfa,wfatl,wfh,whataburger,whats,whd,whilst,white,wholly,whopping,whys,wicked,wide,widely,wider,wild,wildlife,willing,willingness,win,windowing,windows,winners,winning,wins,wipro,wireless,wisdom,wish,wishes,withdraw,wix,wizard,wk,woes,women,won,word,word2vec,words,work,workable,workday,workday_recruiting,worked,worker,workers,workersdevelop,workflow,workflows,workforce,working,workload,workplace,workplaces,workplans,works,workshops,worksites,workstations,world,worldatwork,worldgrid,worldline,worlds,worldwide,worrying,worth,worthy,wotif,wrangler,wrangling,write,writer,writers,writes,writing,written,writtenproven,wrk,wss,wte,wwe,www,www1,x12,x80,x81ciency,x81ed,x81eld,x81nd,x81ndings,x81ned,x81table,x82,x83,x84,x8bthis,x8bwe,x93,x93churn,x93minorities,x94,x94a,x94and,x94combined,x94cryptocurrencies,x94in,x94including,x94investigations,x94legal,x94no,x94senior,x94something,x94supporting,x94the,x94to,x94we,x98,x98always,x98big,x98think,x99,x99d,x99i,x99ll,x99re,x99s,x99t,x99ve,x9c,x9c10,x9cabove,x9camerica,x9cbang,x9cbank,x9cbig,x9cbusy,x9cclean,x9cedr,x9cexcellent,x9cgtb,x9chr,x9ci,x9cit,x9cjob,x9cjust,x9clive,x9cmaintenance,x9cmake,x9cmvp,x9cnca,x9cpeople,x9cppc,x9cpr,x9cprovide,x9creasonable,x9crivr,x9crunway,x9csearch,x9cshare,x9cstem,x9csurge,x9cthe,x9ctranslate,x9cwe,x9cwhen,x9cwhy,x9cwow,x9d,xa0,xa2,xa2gather,xa2hands,xa2knowledge,xa2preferred,xa2showcase,xa2work,xa6,xa6protect,xa6rapidly,xa9,xa9al,xac,xae,xafve,xb7,xb7experience,xb7identify,xb7knowledge,xbb,xbf,xbfs,xbox,xc2,xc3,xe2,xef,xgboost,xj6,xml,xpo,yard,year,yeara,yearas,yearbenefits,yeardata,yeardescription,yeargcc,yeargrowing,yearjob,yearjr,yearly,years,yearsexperience,yearsummary,yearthe,yearthere,yeartitle,yearunder,yearworking,yearworks,yes,yield,yoga,york,young,youth,youtube,yr,yrs,yyyy,zeiss,zendesk,zero,zeta,zetahub,zeus,zheng,zillow,zogsports,zoho,zone,zones,zoom,zywave
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,2,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,15,0,0,0,0,0,0,0,0,0,15,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

LR = LogisticRegression(solver='lbfgs', random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train ROC_AUC Score: {roc_auc_score(y_train, train_predictions)}')
print(f'Test ROC_AUC Score: {roc_auc_score(y_test, test_predictions)}')

Train ROC_AUC Score: 0.992547018807523
Test ROC_AUC Score: 0.9613526570048309




Random Forest Classiffier gets a better score here -- We'll try it again below with some tweaks

In [17]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier(n_estimators=100).fit(X_train_vectorized, y_train)

rfc_train_predictions = RFC.predict(X_train_vectorized)
rfc_test_predictions = RFC.predict(X_test_vectorized)

print(f'Train ROC_AUC Score: {roc_auc_score(y_train, rfc_train_predictions)}')
print(f'Test ROC_AUC Score: {roc_auc_score(y_test, rfc_test_predictions)}')

Train ROC_AUC Score: 0.9923469387755103
Test ROC_AUC Score: 0.9520933977455717


### Second Round - Using Limited Number of Features/Words

In [18]:
vectorizer2 = CountVectorizer(max_features=80, ngram_range=(1,2), stop_words='english')

vectorizer2.fit(X_train)

print(vectorizer2.vocabulary_)

{'company': 8, 'data': 11, 'analysis': 2, 'scientist': 54, 'role': 52, 'analytics': 5, 'team': 64, 'xe2': 77, 'x80': 74, 'help': 22, 'technology': 67, 'work': 71, 'x99s': 76, 'responsibilities': 51, 'python': 42, 'develop': 17, 'models': 32, 'technical': 66, 'development': 18, 'quality': 44, 'design': 16, 'science': 53, 'advanced': 1, 'new': 33, 'statistical': 58, 'sql': 57, 'research': 50, 'required': 48, 'qualifications': 43, 'degree': 15, 'computer': 9, 'statistics': 59, 'years': 79, 'experience': 21, 'analytical': 4, 'knowledge': 27, 'business': 7, 'time': 68, 'machine': 29, 'learning': 28, 'including': 23, 'engineering': 19, 'working': 72, 'ability': 0, 'preferred': 37, 'using': 70, 'tools': 69, 'data scientist': 14, 'xe2 x80': 78, 'x80 x99s': 75, 'data science': 13, 'machine learning': 30, 'position': 36, 'analyst': 3, 'job': 26, 'support': 62, 'systems': 63, 'solutions': 56, 'management': 31, 'insights': 25, 'problems': 38, 'reporting': 46, 'related': 45, 'skills': 55, 'customer

In [19]:
train_word_counts2 = vectorizer2.transform(X_train)
X_train_vectorized2 = pd.DataFrame(train_word_counts2.toarray(),
                                  columns=vectorizer2.get_feature_names())

print(X_train_vectorized2.shape)
X_train_vectorized2.head()

(400, 80)


Unnamed: 0,ability,advanced,analysis,analyst,analytical,analytics,build,business,company,computer,customer,data,data analyst,data science,data scientist,degree,design,develop,development,engineering,environment,experience,help,including,information,insights,job,knowledge,learning,machine,machine learning,management,models,new,opportunity,people,position,preferred,problems,product,projects,provide,python,qualifications,quality,related,reporting,reports,required,requirements,research,responsibilities,role,science,scientist,skills,solutions,sql,statistical,statistics,status,strong,support,systems,team,teams,technical,technology,time,tools,using,work,working,world,x80,x80 x99s,x99s,xe2,xe2 x80,years
0,1,1,2,0,1,3,0,1,1,1,0,20,0,2,4,1,1,2,1,1,0,8,2,3,0,0,0,1,2,2,2,0,2,1,0,0,0,1,0,0,0,0,2,2,1,0,0,0,2,0,1,1,1,3,4,0,0,2,1,2,0,0,0,0,6,0,2,3,2,1,1,2,1,0,4,1,1,4,4,2
1,4,0,3,3,0,0,0,1,0,0,1,15,3,0,0,1,0,1,1,0,0,5,0,0,0,1,3,1,0,0,0,2,2,0,1,0,3,0,1,0,0,0,0,1,0,2,1,0,2,1,0,2,0,1,0,5,1,0,0,2,0,1,6,2,1,0,3,0,1,2,2,4,2,0,3,2,2,3,3,1
2,0,0,0,2,0,0,0,0,2,1,0,5,2,0,2,1,0,0,1,0,1,6,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,2,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,2,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1
3,0,0,1,2,0,1,0,1,0,0,0,4,2,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
4,8,1,8,7,2,1,3,10,0,1,0,17,6,0,0,1,0,3,4,0,4,6,4,3,9,4,1,4,0,0,0,5,0,1,0,0,2,2,2,0,5,2,0,0,1,0,6,1,1,1,0,2,3,1,0,3,2,0,1,0,0,2,2,1,2,2,0,2,0,4,3,9,3,8,3,3,3,3,3,3


In [20]:
test_word_counts2 = vectorizer2.transform(X_test)
X_test_vectorized2 = pd.DataFrame(test_word_counts2.toarray(),
                                 columns=vectorizer2.get_feature_names())

print(X_test_vectorized2.shape)
X_test_vectorized2.head()

(100, 80)


Unnamed: 0,ability,advanced,analysis,analyst,analytical,analytics,build,business,company,computer,customer,data,data analyst,data science,data scientist,degree,design,develop,development,engineering,environment,experience,help,including,information,insights,job,knowledge,learning,machine,machine learning,management,models,new,opportunity,people,position,preferred,problems,product,projects,provide,python,qualifications,quality,related,reporting,reports,required,requirements,research,responsibilities,role,science,scientist,skills,solutions,sql,statistical,statistics,status,strong,support,systems,team,teams,technical,technology,time,tools,using,work,working,world,x80,x80 x99s,x99s,xe2,xe2 x80,years
0,2,1,0,1,2,0,0,0,1,0,2,12,1,0,0,0,0,0,0,0,0,5,0,3,2,0,1,0,0,0,0,4,0,1,2,0,0,1,0,0,1,1,0,2,1,2,0,1,2,0,2,1,0,0,0,4,0,0,1,0,4,2,0,0,1,1,0,0,1,0,0,4,2,1,0,0,0,0,0,3
1,0,0,1,0,0,0,0,1,2,0,0,7,0,1,3,0,0,0,0,0,0,0,1,0,0,2,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,4,0,0,1,0,0,0,0,1,0,0,1,1,3,0,0,0,0,0,0,0,0,1,4,1,0,0,0,1,0,0,1,0,0,0,0,0,0,1
2,2,0,0,2,1,0,0,3,1,0,0,9,2,0,0,0,0,0,1,0,1,0,0,2,0,0,1,1,0,0,0,1,0,0,1,1,0,1,0,0,3,0,0,0,1,1,0,0,2,1,2,1,0,0,0,3,0,0,0,2,0,1,1,0,3,1,0,0,2,0,0,2,1,0,3,2,2,3,3,0
3,2,1,4,0,2,3,2,5,1,1,0,15,0,1,1,1,1,0,0,2,0,4,0,0,0,4,0,0,1,1,1,0,0,1,1,2,0,1,2,5,1,1,1,0,0,0,0,0,0,1,1,1,2,3,1,5,0,1,2,0,3,3,0,0,2,1,2,3,0,1,0,6,1,1,1,1,1,1,1,2
4,0,0,3,0,0,0,0,1,2,1,2,13,0,1,2,1,0,0,1,0,0,4,2,0,3,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,1,0,0,1,3,1,0,1,0,0,0,1,0,2,2,4,0,1,2,2,0,0,0,3,0,0,0,0,0,1,2,1,2,0,1,1,1,1,1,0


In [21]:
RFC2 = RandomForestClassifier(n_estimators=100).fit(X_train_vectorized2, y_train)

train_predictions2 = RFC2.predict(X_train_vectorized2)
test_predictions2 = RFC2.predict(X_test_vectorized2)

print(f'Train ROC_AUC Score: {roc_auc_score(y_train, train_predictions2)}')
print(f'Test ROC_AUC Score: {roc_auc_score(y_test, test_predictions2)}')

Train ROC_AUC Score: 0.9924469787915166
Test ROC_AUC Score: 0.9814814814814815


## TF-IDF Vectorizer

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer_tfidf = TfidfVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')
vectorizer_tfidf.fit(X_train)

print(vectorizer.vocabulary_)

{'company': 1691, 'overview': 5611, 'digital': 2425, 'assets': 803, 'data': 2167, 'leader': 4588, 'helping': 3784, 'sophisticated': 7272, 'institutional': 4202, 'investors': 4353, 'understand': 8118, 'cryptoasset': 2093, 'markets': 4883, 'unique': 8138, 'sets': 7058, 'insightful': 4180, 'analysis': 623, 'fast': 3198, 'growing': 3674, 'startup': 7429, '5m': 209, 'seed': 6985, 'funding': 3462, 'rolled': 6787, 'software': 7245, 'crypto': 2091, 'funds': 3464, 'scientist': 6922, 'role': 6784, 'looking': 4759, 'join': 4426, 'analytics': 633, 'team': 7742, 'based': 985, 'denver': 2300, 'xe2': 8672, 'x80': 8562, 'x99ll': 8601, 'high': 3797, 'performing': 5750, 'leading': 4591, 'disruption': 2521, 'investment': 4351, 'industry': 4088, 'motivated': 5174, 'passionate': 5689, 'designing': 2344, 'novel': 5369, 'trading': 7945, 'indicators': 4076, 'indices': 4077, 'analyses': 621, 'emerging': 2795, 'field': 3237, 'cryptoanalysis': 2092, 'producing': 6127, 'thought': 7854, 'leadership': 4590, 'help':

In [26]:
train_word_counts_tfidf = vectorizer_tfidf.transform(X_train)
X_train_vectorized_tfidf = pd.DataFrame(train_word_counts_tfidf.toarray(),
                                       columns=vectorizer_tfidf.get_feature_names())

print(X_train_vectorized_tfidf.shape)
X_train_vectorized_tfidf.head()

(400, 8723)


Unnamed: 0,00,000,00011236,00079,00805,00am,00pm,01,02115,03,0356,04,062,06366,07,0730,09,093,10,100,1000,10003093,1001,100k,100x,101,103,105,1082692,10g,10ms,10x,11,1100,113,114157802,114182628,116,118,12,120,125,126,13,130,1315,1324b,136,14,140,1400,143,15,150,1500,15000,15454,16,160,169334br,17,170,17b,17th,18,180,188,18b,19,190007ji,19000bn4,19000c1k,19002353,19004162,1901744,19050,1938,1944,1947,1967,1969,1970,1974,1988,199,1990,1996,1997,1999,1b,1st,1strategy,1tb,20,200,2000,200041690,200046821,2003,2004,2005,2006,2007,2008,201,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,202,2020,2021,2030,2050,207,20m,21,210,2145,21st,22,222,2272,23,24,24951,24m,25,26,260,27,270,271,28,29,2b,2dreaminc,2nd,2qxtl4a,2x,30,300,30328,309,30am,31,310,311,3127828,3148,317,3188439,320,33,3316,33634,338,35,350,36,360,374,38,39,396,3rd,3x,40,400,401,401k,4075,42,43,447618,45,450,457,461,485,4901,4919,4g,4th,50,500,5000,503,50k,51,515,52799,539,540,55,56,561,57,57th,58,59,59pm,5g,5m,5pm,5th,60,600,60k,613,62,64,6468,65,650,67,670,68,6890,6903,695,70,700,711,72,734,75,750,755,756,78,78205,78216,784,80,800,80bn,82,83,832,835experience,837,844,85,...,wet,wetlands,wework,wfa,wfatl,wfh,whataburger,whats,whd,whilst,white,wholly,whopping,whys,wicked,wide,widely,wider,wild,wildlife,willing,willingness,win,windowing,windows,winners,winning,wins,wipro,wireless,wisdom,wish,wishes,withdraw,wix,wizard,wk,woes,women,won,word,word2vec,words,work,workable,workday,workday_recruiting,worked,worker,workers,workersdevelop,workflow,workflows,workforce,working,workload,workplace,workplaces,workplans,works,workshops,worksites,workstations,world,worldatwork,worldgrid,worldline,worlds,worldwide,worrying,worth,worthy,wotif,wrangler,wrangling,write,writer,writers,writes,writing,written,writtenproven,wrk,wss,wte,wwe,www,www1,x12,x80,x81ciency,x81ed,x81eld,x81nd,x81ndings,x81ned,x81table,x82,x83,x84,x8bthis,x8bwe,x93,x93churn,x93minorities,x94,x94a,x94and,x94combined,x94cryptocurrencies,x94in,x94including,x94investigations,x94legal,x94no,x94senior,x94something,x94supporting,x94the,x94to,x94we,x98,x98always,x98big,x98think,x99,x99d,x99i,x99ll,x99re,x99s,x99t,x99ve,x9c,x9c10,x9cabove,x9camerica,x9cbang,x9cbank,x9cbig,x9cbusy,x9cclean,x9cedr,x9cexcellent,x9cgtb,x9chr,x9ci,x9cit,x9cjob,x9cjust,x9clive,x9cmaintenance,x9cmake,x9cmvp,x9cnca,x9cpeople,x9cppc,x9cpr,x9cprovide,x9creasonable,x9crivr,x9crunway,x9csearch,x9cshare,x9cstem,x9csurge,x9cthe,x9ctranslate,x9cwe,x9cwhen,x9cwhy,x9cwow,x9d,xa0,xa2,xa2gather,xa2hands,xa2knowledge,xa2preferred,xa2showcase,xa2work,xa6,xa6protect,xa6rapidly,xa9,xa9al,xac,xae,xafve,xb7,xb7experience,xb7identify,xb7knowledge,xbb,xbf,xbfs,xbox,xc2,xc3,xe2,xef,xgboost,xj6,xml,xpo,yard,year,yeara,yearas,yearbenefits,yeardata,yeardescription,yeargcc,yeargrowing,yearjob,yearjr,yearly,years,yearsexperience,yearsummary,yearthe,yearthere,yeartitle,yearunder,yearworking,yearworks,yes,yield,yoga,york,young,youth,youtube,yr,yrs,yyyy,zeiss,zendesk,zero,zeta,zetahub,zeus,zheng,zillow,zogsports,zoho,zone,zones,zoom,zywave
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.074153,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028267,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019105,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06466,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.074153,0.0,0.0,0.074153,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0395,0.0,0.019105,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06466,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034378,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.079822,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.079822,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.060857,0.071041,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04113,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043562,0.0,0.0,0.0,0.038305,0.0,0.0,0.0,0.079822,0.0,0.0,0.0,0.0,0.0,0.052203,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.079822,0.0,0.0,0.04113,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052203,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.018503,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.238007,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.124006,0.0,0.0,0.0,0.209026,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.332675,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.057557,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.051786,0.0,0.0,0.0,0.0,0.209026,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04871,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.015948,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.031896,0.016618,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.068375,0.0,0.0,0.0,0.0,0.0,0.0,0.03954,0.026645,0.0,0.0,0.024718,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022291,0.0,0.0,0.0,0.028012,0.0,0.0,0.0,0.0,0.029938,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.029938,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014191,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028012,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020692,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.029938,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.209568,0.389198,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.051357,0.026645,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02314,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0759,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019579,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02314,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019579,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02082,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.029938,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
test_word_counts_tfidf = vectorizer_tfidf.transform(X_test)
X_test_vectorized_tfidf = pd.DataFrame(test_word_counts_tfidf.toarray(),
                                      columns=vectorizer_tfidf.get_feature_names())

print(X_test_vectorized_tfidf.shape)
X_test_vectorized_tfidf.head()

(100, 8723)


Unnamed: 0,00,000,00011236,00079,00805,00am,00pm,01,02115,03,0356,04,062,06366,07,0730,09,093,10,100,1000,10003093,1001,100k,100x,101,103,105,1082692,10g,10ms,10x,11,1100,113,114157802,114182628,116,118,12,120,125,126,13,130,1315,1324b,136,14,140,1400,143,15,150,1500,15000,15454,16,160,169334br,17,170,17b,17th,18,180,188,18b,19,190007ji,19000bn4,19000c1k,19002353,19004162,1901744,19050,1938,1944,1947,1967,1969,1970,1974,1988,199,1990,1996,1997,1999,1b,1st,1strategy,1tb,20,200,2000,200041690,200046821,2003,2004,2005,2006,2007,2008,201,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,202,2020,2021,2030,2050,207,20m,21,210,2145,21st,22,222,2272,23,24,24951,24m,25,26,260,27,270,271,28,29,2b,2dreaminc,2nd,2qxtl4a,2x,30,300,30328,309,30am,31,310,311,3127828,3148,317,3188439,320,33,3316,33634,338,35,350,36,360,374,38,39,396,3rd,3x,40,400,401,401k,4075,42,43,447618,45,450,457,461,485,4901,4919,4g,4th,50,500,5000,503,50k,51,515,52799,539,540,55,56,561,57,57th,58,59,59pm,5g,5m,5pm,5th,60,600,60k,613,62,64,6468,65,650,67,670,68,6890,6903,695,70,700,711,72,734,75,750,755,756,78,78205,78216,784,80,800,80bn,82,83,832,835experience,837,844,85,...,wet,wetlands,wework,wfa,wfatl,wfh,whataburger,whats,whd,whilst,white,wholly,whopping,whys,wicked,wide,widely,wider,wild,wildlife,willing,willingness,win,windowing,windows,winners,winning,wins,wipro,wireless,wisdom,wish,wishes,withdraw,wix,wizard,wk,woes,women,won,word,word2vec,words,work,workable,workday,workday_recruiting,worked,worker,workers,workersdevelop,workflow,workflows,workforce,working,workload,workplace,workplaces,workplans,works,workshops,worksites,workstations,world,worldatwork,worldgrid,worldline,worlds,worldwide,worrying,worth,worthy,wotif,wrangler,wrangling,write,writer,writers,writes,writing,written,writtenproven,wrk,wss,wte,wwe,www,www1,x12,x80,x81ciency,x81ed,x81eld,x81nd,x81ndings,x81ned,x81table,x82,x83,x84,x8bthis,x8bwe,x93,x93churn,x93minorities,x94,x94a,x94and,x94combined,x94cryptocurrencies,x94in,x94including,x94investigations,x94legal,x94no,x94senior,x94something,x94supporting,x94the,x94to,x94we,x98,x98always,x98big,x98think,x99,x99d,x99i,x99ll,x99re,x99s,x99t,x99ve,x9c,x9c10,x9cabove,x9camerica,x9cbang,x9cbank,x9cbig,x9cbusy,x9cclean,x9cedr,x9cexcellent,x9cgtb,x9chr,x9ci,x9cit,x9cjob,x9cjust,x9clive,x9cmaintenance,x9cmake,x9cmvp,x9cnca,x9cpeople,x9cppc,x9cpr,x9cprovide,x9creasonable,x9crivr,x9crunway,x9csearch,x9cshare,x9cstem,x9csurge,x9cthe,x9ctranslate,x9cwe,x9cwhen,x9cwhy,x9cwow,x9d,xa0,xa2,xa2gather,xa2hands,xa2knowledge,xa2preferred,xa2showcase,xa2work,xa6,xa6protect,xa6rapidly,xa9,xa9al,xac,xae,xafve,xb7,xb7experience,xb7identify,xb7knowledge,xbb,xbf,xbfs,xbox,xc2,xc3,xe2,xef,xgboost,xj6,xml,xpo,yard,year,yeara,yearas,yearbenefits,yeardata,yeardescription,yeargcc,yeargrowing,yearjob,yearjr,yearly,years,yearsexperience,yearsummary,yearthe,yearthere,yeartitle,yearunder,yearworking,yearworks,yes,yield,yoga,york,young,youth,youtube,yr,yrs,yyyy,zeiss,zendesk,zero,zeta,zetahub,zeus,zheng,zillow,zogsports,zoho,zone,zones,zoom,zywave
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.064714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043737,0.0,0.041518,0.0,0.0,0.0,0.0,0.0,0.0,0.026899,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028543,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.059028,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.056896,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.074768,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03008,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.350379,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013204,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.560762,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.531338,0.045612,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01188,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.051607,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0659,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045922,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.060726,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.051208,0.0,0.0,0.036931,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02496,0.073754,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.032578,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.063359,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04266,0.0,0.0,0.0,0.0,0.04992,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.063359,0.0,0.0,0.0,0.0,0.0,0.0,0.041781,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.360099,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.047584,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.102953,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.023194,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028529,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.030272,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.023194,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041737,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.124892,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025442,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06878,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.029099,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03439,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.029099,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
LR_tfidf = LogisticRegression(solver='lbfgs', random_state=42).fit(X_train_vectorized_tfidf, y_train)
train_predictions_tfidf = LR_tfidf.predict(X_train_vectorized_tfidf)
test_predictions_tfidf = LR_tfidf.predict(X_test_vectorized_tfidf)

print(f'Train ROC_AUC Score: {roc_auc_score(y_train, train_predictions_tfidf)}')
print(f'Test ROC_AUC Score: {roc_auc_score(y_test, test_predictions_tfidf)}')

Train ROC_AUC Score: 0.9775410164065627
Test ROC_AUC Score: 0.9303542673107892


In [30]:
RFC_tfidf = RandomForestClassifier(n_estimators=100).fit(X_train_vectorized_tfidf, y_train)
train_predictions_tfidf2 = RFC_tfidf.predict(X_train_vectorized_tfidf)
test_predictions_tfidf2 = RFC_tfidf.predict(X_test_vectorized_tfidf)

print(f'Train ROC_AUC Score: {roc_auc_score(y_train, train_predictions_tfidf2)}')
print(f'Test ROC_AUC Score: {roc_auc_score(y_test, test_predictions_tfidf2)}')

Train ROC_AUC Score: 0.9924469787915166
Test ROC_AUC Score: 0.9613526570048309


# Stretch Goals

- Try some agglomerative clustering using cosine-similarity-distance! (works better with high dimensional spaces) robust clustering - Agglomerative clustering like Ward would be cool. Try and create an awesome Dendrogram of the most important terms from the dataset.

- Awesome resource for clustering stretch goals: 
 - Agglomerative Clustering with Scipy: <https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/>
 - Agglomerative Clustering for NLP: <http://brandonrose.org/clustering>
 
- Use Latent Dirichlet Allocation (LDA) to perform topic modeling on the dataset: 
 - Topic Modeling and LDA in Python: <https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24>
 - Topic Modeling and LDA using Gensim: <https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/>
