## Step04 - Implementing and Configuring StopWords on Culture and Work Environment Data ## 

Now that we have Countvectorized our documents, we still have the additional step of adding the function of 
"StopWords" so that we only get the words that have the most relevance to our project. For example words like 
"and", "apply" and applicants etc are not needed as part of our analysis. 



In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import stop_words

import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
from nltk.tokenize import RegexpTokenizer

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

In [2]:

custom_stopwords = ['000', '01', '06', '08','10254', '12', '15',
                   '19', '2018', '22', '25', '28', '45', '500',
                   'cox', 'norfolk', 'apply', 'com', 'www', 'applications', 'application',
                   'applicants', 'southern', 'https', 'ia', 'var', 'indeedapply', 'env',
                   'atlanta', 'opportunity', 'iip', 'gender', 'location', 'new', 'employer',
                   'midtown', 'manheim', 'ml', 'including', 'llc', 'truck', 'automotive', 'nationality', 
                   'nation', 'iot', 'kelley', 'hopea', 'date', 'incadea', 'honeywell', 'j00067915', '00', '10', 
                   '100', '16', '16614', '18', '18230', '1901', '20', '2006', '2013', '24', '27', '401', '40b',
                   '420', '450', '50', '60', 'kentucky'
                   
                   ]

**min_df function within Countvectorizer**

Note the use of min_df. This removes the words that are occuring more than 'n' number of times. 

In this case we are limiting words that are occurring more than 10 times. 

In [3]:


# new list = default list stop_words.ENGLISH_STOP_WORDS
# new list.append(whatever)

# cvec = CountVectorizer(stop_words = new list)

new_stop = list(stop_words.ENGLISH_STOP_WORDS)

new_stop.extend(custom_stopwords)


cvec = CountVectorizer(stop_words = new_stop, min_df=10)


Para = pd.read_csv('Paragraphs')

cvec.fit(Para['0'])
new_corpus = cvec.transform(Para['0'])
new_corpus

<148x9 sparse matrix of type '<class 'numpy.int64'>'
	with 134 stored elements in Compressed Sparse Row format>

In [4]:
new_corpus.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 1, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

In [5]:
Para['0'][6]

'Job Type: Contract'

`cvec.fit()` will be expecting a pandas Series of text objects:

In [6]:
cvec.fit(Para['0'])
new_corpus = cvec.transform(Para['0'])
new_corpus

<148x9 sparse matrix of type '<class 'numpy.int64'>'
	with 134 stored elements in Compressed Sparse Row format>

In [7]:
new_corpus.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 1, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

In [8]:
df2  = pd.DataFrame(new_corpus.todense(),
                   columns=cvec.get_feature_names())

df2.head()

df2.sort_values('data', ascending=False).head(10)

Unnamed: 0,analytics,data,experience,job,salary,scientist,skills,team,work
135,2,11,0,0,0,2,0,3,2
109,5,6,0,0,0,0,1,2,0
139,0,5,0,1,0,1,0,0,0
107,2,5,4,0,0,2,1,0,1
113,1,5,5,0,0,0,1,0,1
5,1,5,3,0,0,0,1,0,1
93,0,4,0,0,0,2,1,1,0
13,0,4,0,0,0,0,0,1,0
40,1,4,0,0,0,1,0,1,0
94,0,4,0,0,0,2,0,1,1


In [9]:
df2.sum().sort_values(ascending=False)

data          74
team          24
job           24
analytics     22
salary        21
experience    21
scientist     18
work          14
skills        10
dtype: int64

In [10]:
#clean it up by importing STOP Words

from sklearn.feature_extraction import stop_words
 
print(stop_words.ENGLISH_STOP_WORDS)

frozenset({'as', 'should', 'etc', 'against', 'of', 'namely', 'than', 'when', 'done', 'may', 'no', 'will', 'since', 'this', 'become', 'cannot', 'cant', 'be', 'both', 'then', 'yet', 'below', 'our', 'somewhere', 'yourself', 'are', 'she', 'someone', 'many', 'throughout', 'next', 'everywhere', 'he', 'after', 'do', 'towards', 'even', 'bill', 'all', 'else', 'fifty', 'her', 'move', 'among', 'beyond', 'now', 'eight', 'can', 'fifteen', 'most', 'only', 'six', 'please', 'across', 'serious', 'toward', 'we', 'yours', 'front', 'other', 'keep', 'had', 'con', 'became', 'over', 'mill', 'own', 'whereupon', 'onto', 'go', 'perhaps', 'thereupon', 'thin', 'co', 'very', 'nor', 'that', 'further', 'am', 'noone', 'thick', 'whereas', 'anyway', 'by', 'his', 'well', 'around', 'must', 'those', 'each', 'thereby', 'been', 'couldnt', 'afterwards', 'hereafter', 'ourselves', 'wherein', 'was', 'sincere', 'although', 'is', 'thru', 'until', 'nobody', 'enough', 'above', 'one', 'behind', 'formerly', 'last', 'still', 'somethin

**Saving cleaned up data to a CSV**

We have cleaned up our "Paragraphs" data so that we only get the words that are most relevant to work environment and cultural requirements of the job. 

Here we will save it to a CSV so that it is available for later use. 

We will read the CSV that we have just saved and do some additional cleanup because the process of saving the CSV seems to add unwanted words or characters. 

In [11]:
#save cleaned up dataframe

df2.to_csv('Cultural_Word_Rank')

In [12]:
df2 = pd.read_csv('Cultural_Word_Rank')

In [13]:
df2.sum().sort_values(ascending=False)

Unnamed: 0    10878
data             74
team             24
job              24
analytics        22
salary           21
experience       21
scientist        18
work             14
skills           10
dtype: int64

In [14]:
df2

Unnamed: 0.1,Unnamed: 0,analytics,data,experience,job,salary,scientist,skills,team,work
0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0
2,2,0,1,0,0,0,0,0,0,0
3,3,0,1,0,1,0,0,0,0,0
4,4,0,0,0,0,0,0,0,0,0
5,5,1,5,3,0,0,0,1,0,1
6,6,0,0,0,1,0,0,0,0,0
7,7,0,0,0,0,0,0,0,0,1
8,8,0,0,0,0,0,0,0,0,0
9,9,0,0,0,0,0,0,0,0,0


In [15]:
df2.iloc[:,0:10].sum()

Unnamed: 0    10878
analytics        22
data             74
experience       21
job              24
salary           21
scientist        18
skills           10
team             24
work             14
dtype: int64

In [16]:
CountVectorizer(min_df=1)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [17]:
df2.sum().sort_values(ascending=True)

skills           10
work             14
scientist        18
experience       21
salary           21
analytics        22
job              24
team             24
data             74
Unnamed: 0    10878
dtype: int64

In [18]:
list(df2.columns.values)

['Unnamed: 0',
 'analytics',
 'data',
 'experience',
 'job',
 'salary',
 'scientist',
 'skills',
 'team',
 'work']

In [19]:
cultural_words = df2

In [20]:
cultural_words.columns 

Index(['Unnamed: 0', 'analytics', 'data', 'experience', 'job', 'salary',
       'scientist', 'skills', 'team', 'work'],
      dtype='object')

In [21]:
df2.sum().sort_values()[df2.sum().sort_values() > 10].index

Index(['work', 'scientist', 'experience', 'salary', 'analytics', 'job', 'team',
       'data', 'Unnamed: 0'],
      dtype='object')