# Feature Engineering

The next step is to create features from the raw text so we can train the machine learning models. The steps followed are:

1. **Text Cleaning and Preparation**: cleaning of special characters, downcasing, punctuation signs. possessive pronouns and stop words removal and lemmatization. 
2. **Label coding**: creation of a dictionary to map each category to a code.
3. **Train-test split**: to test the models on unseen data.
4. **Text representation**: use of TF-IDF scores to represent text.

In [248]:
import pickle
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
import numpy as np

In [249]:
import import_func as imp
import classifier_help as cls

First of all we'll load the dataset:

In [250]:
path_df = "pickles/hand_coded_ALL.pickle"
# full_filename = "../data/by_article_fulltext_112919-2.jl" # OLD VERSION
full_filename = "../data/by_article_fulltext_020920.jl"


with open(path_df, 'rb') as data:
    coded_df = pickle.load(data)

df_full = imp.init_df(full_filename, "full", "df")

In [251]:
coded_df.head()

Unnamed: 0,id,first,second,wc,first_f,second_f,input
0,730,54,0,850,0.063529,0.0,1
1,4200,10,0,790,0.012658,0.0,3
2,4453,76,1,1088,0.069853,0.000919,1
3,286,3,64,1229,0.002441,0.052075,2
4,1034,6,29,1322,0.004539,0.021936,3


Need to cut the df down to only the ones that have been categorized and add the categories project

In [252]:
df = coded_df.merge(right=df_full, how="left", left_on="id", right_index=True)

print(len(df))
df.head()

197


Unnamed: 0,id_x,first,second,wc,first_f,second_f,input,id_y,headline,tags,...,date,time,text,bio,date_seq,month_seq,year,n_posts_author,column1,column2
0,730,54,0,850,0.063529,0.0,1,6055,"diary of a british scientist, part 2: brushing...","[job market, europe]",...,1999-06-04,8:00 am,by so after deciding that i wanted...,[],977,42,1999,7,no,yes
1,4200,10,0,790,0.012658,0.0,3,3389,alyson reed takes the helm at npa,"[issues and perspectives, advice, postdoc, ame...",...,2003-09-19,4:00 am,"by n 4 september, a new force joined the...","[xenia morin, ph.d., is a keck postdoctoral te...",2545,93,2003,1,no,no
2,4453,76,1,1088,0.069853,0.000919,1,3612,unveiling the blindness,"[workplace diversity, myscinet, undergraduate,...",...,2004-02-20,0:00 am,"by n a daily basis, i strive to be the ...",[],2699,98,2004,1,no,no
3,286,3,64,1229,0.002441,0.052075,2,87,talk yourself right into a job,"[tooling up, column, non-disciplinary]",...,2017-04-12,2:30 pm,by i’m sure you’ve heard the expression use...,[],7499,256,2017,247,no,yes
4,1034,6,29,1322,0.004539,0.021936,3,4704,loan-repayment for biomedical researchers,"[advice, americas]",...,2001-12-14,0:00 am,by whatever happe...,"[due to the high volume of questions received,...",1901,72,2001,84,no,yes


And visualize one sample news content:

In [253]:
# since this notebook uses the column heading content and I don't feel like changing every single one yet
df["Content"] = df["text"]

print(df.loc[128]['input'])
print(df.loc[128]['Content'])


3
  by          dward ruthazer (pictured left) is a mapmaker of sorts, but the maps he makes are not of places in the world. ruthazer studies brain development, charting intricate neurocircuitries in the hope of advancing treatments for injuries of the central nervous system and therapies for developmental disorders. like the neural connections he maps, ruthazer's path has twisted and turned from china to japan and across north america, as he has investigated the mechanisms that shape the wiring of the human brain. for this young neuroscientist, working overseas in different cultures and acquiring new personal and professional skills provided a sense of independence and confidence as a researcher, qualities that he tries to instill in the students and trainees working in his laboratory.       less than a year after finishing his second postdoc at cold spring harbor laboratory in new york, ruthazer is settling in as assistant professor at mcgill university's montreal neurological instit

2## 1. Text cleaning and preparation

### 1.1. Special character cleaning

We can see the following special characters:

* ``\r``
* ``\n``
* ``\`` before possessive pronouns (`government's = government\'s`)
* ``\`` before possessive pronouns 2 (`Yukos'` = `Yukos\'`)
* ``"`` when quoting text

In [254]:
# \r and \n
df['Content_Parsed_1'] = df['Content'].str.replace("\r", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("\n", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("    ", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("\'s'", "")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("\'", "")
# because I still need quotation marks
print(df.loc[1]["Content_Parsed_1"].count("\""))

9


### 1.2. Upcase/downcase

I already did this when importing the data

### 1.3. Punctuation signs

Punctuation signs won't have any predicting power, so we'll just get rid of them.

In [255]:
## I use my own function to remove quotations before removing these punctuation marks

df["Content_Parsed_2"] = cls.no_punctuation(df["Content_Parsed_1"], quotes=True)
df["Content_Parsed_3"] = [cls.replace_quotes(text) for text in df["Content_Parsed_2"]]

In [256]:
print(df.loc[128]['input'])
print(df.loc[128]['Content_Parsed_3'])

3
  by    dward ruthazer (pictured left) is a mapmaker of sorts, but the maps he makes are not of places in the world. ruthazer studies brain development, charting intricate neurocircuitries in the hope of advancing treatments for injuries of the central nervous system and therapies for developmental disorders. like the neural connections he maps, ruthazers path has twisted and turned from china to japan and across north america, as he has investigated the mechanisms that shape the wiring of the human brain. for this young neuroscientist, working overseas in different cultures and acquiring new personal and professional skills provided a sense of independence and confidence as a researcher, qualities that he tries to instill in the students and trainees working in his laboratory.    less than a year after finishing his second postdoc at cold spring harbor laboratory in new york, ruthazer is settling in as assistant professor at mcgill universitys montreal neurological institute (mni) i

In [257]:
punctuation_signs = list("?:!.,;\"()")

for punct_sign in punctuation_signs:
    df['Content_Parsed_3'] = df['Content_Parsed_3'].str.replace(punct_sign, '')

for punct_sign_sp in list("-/"):
    df['Content_Parsed_3'] = df['Content_Parsed_3'].replace(punct_sign_sp, ' ')

In [258]:
# print(df.iloc[135]['input'])
# print(df.iloc[135]['Content_Parsed_3'])

By doing this we are messing up with some numbers, but it's no problem since we aren't expecting any predicting power from them.

### 1.4. Possessive pronouns

We'll also remove possessive pronoun terminations:

In [259]:
df['Content_Parsed_4'] = df['Content_Parsed_3'].str.replace("'s", "")

### 1.5. Stemming and Lemmatization

Since stemming can produce output words that don't exist, we'll only use a lemmatization process at this moment. Lemmatization takes into consideration the morphological analysis of the words and returns words that do exist, so it will be more useful for us.

In [260]:
# Downloading punkt and wordnet from NLTK
nltk.download('punkt')
print("------------------------------------------------------------")
nltk.download('wordnet')

------------------------------------------------------------
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Clara\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Clara\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [261]:
# Saving the lemmatizer into an object
wordnet_lemmatizer = WordNetLemmatizer()

In order to lemmatize, we have to iterate through every word:

In [262]:
nrows = len(df)
lemmatized_text_list = []

for row in range(0, nrows):
    
    # Create an empty list containing lemmatized words
    lemmatized_list = []
    
    # Save the text and its words into an object
    text = df.iloc[row]['Content_Parsed_4']
    text_words = text.split(" ")

    # Iterate through every word to lemmatize
    for word in text_words:
        lemmatized_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
        
    # Join the list
    lemmatized_text = " ".join(lemmatized_list)
    
    # Append to the list containing the texts
    lemmatized_text_list.append(lemmatized_text)

In [263]:
df['Content_Parsed_5'] = lemmatized_text_list

In [264]:
print(df.iloc[148]['input'])
print(df.iloc[148]['Content_Parsed_5'])

2
  by  in a recent essay publish at   career   a well-published scientist and adjunct lecturer at the university of florida in gainesville explain what she have learn in apply for   since earn her doctorate in 2004 kaplan draw the conclusion that emphasize her ability to garner federal research grant be likely to make her more successful in her search for a faculty position right or wrong kaplan’s essay—and her willingness to share her experience—is welcome and useful such first-person reflections be likely to be valuable to job seekers altogether the number of institutions of higher learn in the unite state approach 4000 include more than 3000 nonprofit institutions thats a much bigger pool of potential job than those offer at 200 or so research universities and she may well be right a strong publication record be certainly a key to get hire and anything you can do to convince an institution that you can attract outside fund be likely to increase your odds of win an offer for a tenur

Although lemmatization doesn't work perfectly in all cases (as can be seen in the example below), it can be useful.

### 1.6. Stop words

In [265]:
# Downloading the stop words list
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Clara\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [266]:
# Loading the stop words in english
stop_words = list(stopwords.words('english'))
print(type(stop_words))

<class 'list'>


In [267]:
stop_words[0:10]

pronouns = ["i","im", "ive", "id","my", "me", "myself","we","wed","weve","us","our", "ours","ourselves","you","youre", "youve","youd","your", "yourself", "yourselves","he","hes","hed","hell","him","his", "himself","she","shes","shell","shed","her","hers","herself","they","them","their","theirs","themself","themselves"]
stop_words_wo_pronouns = [word for word in stop_words if word not in pronouns]
stop_words_wo_pronouns[0:10]

["you're",
 "you've",
 "you'll",
 "you'd",
 'yours',
 "she's",
 'it',
 "it's",
 'its',
 'itself']

To remove the stop words, we'll handle a regular expression only detecting whole words, as seen in the following example:

In [268]:
# example = "me eating a meal"
# word = "me"

# # The regular expression is:
# regex = r"\b" + word + r"\b"  # we need to build it like that to work properly

# re.sub(regex, "StopWord", example)

We can now loop through all the stop words:

In [269]:
df['Content_Parsed_6'] = df['Content_Parsed_5']

for stop_word in stop_words_wo_pronouns:

    regex_stopword = r"\b" + stop_word + r"\b"
    df['Content_Parsed_6'] = df['Content_Parsed_6'].str.replace(regex_stopword, '')

We have some dobule/triple spaces between words because of the replacements. However, it's not a problem because we'll tokenize by the spaces later.

As an example, we'll show an original news article and its modifications throughout the process:

In [270]:
df.loc[5]['Content']

'  by  by now, any academic scientist working in national science foundation (nsf)-funded fields is well aware of the agency\'s "broader impacts" (bi) review criterion. for years, applicants for nsf grants have been required to supplement their discussion of a project\'s "intellectual merit" (im) with a separate discussion of bi in the project summary. last week, nsf\'s division of chemical, bioengineering, environmental, and transport systems (cbet), part of its engineering directorate, issued   reminding applicants that starting in 2014 proposals must also include separate bi and im sections in the project description narrative and in the section describing the results of prior nsf support. "if any of these requirements (or any other requirement from nsf 13-1 document) are not met,\xa0the proposal will not pass the nsf compliance check and will be returned without review," writes the cbet staff.\xa0"we would like to avoid such unfortunate instances for our division." nsf\'s grant pro

1. Special character cleaning

In [271]:
df.loc[5]['Content_Parsed_1']

'  by  by now, any academic scientist working in national science foundation (nsf)-funded fields is well aware of the agencys "broader impacts" (bi) review criterion. for years, applicants for nsf grants have been required to supplement their discussion of a projects "intellectual merit" (im) with a separate discussion of bi in the project summary. last week, nsfs division of chemical, bioengineering, environmental, and transport systems (cbet), part of its engineering directorate, issued   reminding applicants that starting in 2014 proposals must also include separate bi and im sections in the project description narrative and in the section describing the results of prior nsf support. "if any of these requirements (or any other requirement from nsf 13-1 document) are not met,\xa0the proposal will not pass the nsf compliance check and will be returned without review," writes the cbet staff.\xa0"we would like to avoid such unfortunate instances for our division." nsfs grant proposal gu

2. Upcase/downcase

In [272]:
df.loc[5]['Content_Parsed_2']

'  by  by now, any academic scientist working in national science foundation (nsf)-funded fields is well aware of the agencys "broader impacts" (bi) review criterion. for years, applicants for nsf grants have been required to supplement their discussion of a projects "intellectual merit" (im) with a separate discussion of bi in the project summary. last week, nsfs division of chemical, bioengineering, environmental, and transport systems (cbet), part of its engineering directorate, issued   reminding applicants that starting in 2014 proposals must also include separate bi and im sections in the project description narrative and in the section describing the results of prior nsf support. "if any of these requirements (or any other requirement from nsf 13-1 document) are not met,\xa0the proposal will not pass the nsf compliance check and will be returned without review," writes the cbet staff.\xa0"we would like to avoid such unfortunate instances for our division." nsfs grant proposal gu

3. Punctuation signs

In [273]:
df.loc[5]['Content_Parsed_3']

'  by  by now any academic scientist working in national science foundation nsf-funded fields is well aware of the agencys  QUOTATION_REPLACEMENT  bi review criterion for years applicants for nsf grants have been required to supplement their discussion of a projects  QUOTATION_REPLACEMENT  im with a separate discussion of bi in the project summary last week nsfs division of chemical bioengineering environmental and transport systems cbet part of its engineering directorate issued   reminding applicants that starting in 2014 proposals must also include separate bi and im sections in the project description narrative and in the section describing the results of prior nsf support  QUOTATION_REPLACEMENT  writes the cbet staff\xa0 QUOTATION_REPLACEMENT  nsfs grant proposal guide is   instructions for preparing the proposal are in   here is the language on the bi criterion from the grant proposal guide  QUOTATION_REPLACEMENT  may be accomplished through the research itself through the activi

4. Possessive pronouns

In [274]:
df.loc[5]['Content_Parsed_4']

'  by  by now any academic scientist working in national science foundation nsf-funded fields is well aware of the agencys  QUOTATION_REPLACEMENT  bi review criterion for years applicants for nsf grants have been required to supplement their discussion of a projects  QUOTATION_REPLACEMENT  im with a separate discussion of bi in the project summary last week nsfs division of chemical bioengineering environmental and transport systems cbet part of its engineering directorate issued   reminding applicants that starting in 2014 proposals must also include separate bi and im sections in the project description narrative and in the section describing the results of prior nsf support  QUOTATION_REPLACEMENT  writes the cbet staff\xa0 QUOTATION_REPLACEMENT  nsfs grant proposal guide is   instructions for preparing the proposal are in   here is the language on the bi criterion from the grant proposal guide  QUOTATION_REPLACEMENT  may be accomplished through the research itself through the activi

5. Stemming and Lemmatization

In [275]:
df.loc[5]['Content_Parsed_5']

'  by  by now any academic scientist work in national science foundation nsf-funded field be well aware of the agencys  QUOTATION_REPLACEMENT  bi review criterion for years applicants for nsf grant have be require to supplement their discussion of a project  QUOTATION_REPLACEMENT  im with a separate discussion of bi in the project summary last week nsfs division of chemical bioengineering environmental and transport systems cbet part of its engineer directorate issue   remind applicants that start in 2014 proposals must also include separate bi and im section in the project description narrative and in the section describe the result of prior nsf support  QUOTATION_REPLACEMENT  write the cbet staff\xa0 QUOTATION_REPLACEMENT  nsfs grant proposal guide be   instructions for prepare the proposal be in   here be the language on the bi criterion from the grant proposal guide  QUOTATION_REPLACEMENT  may be accomplish through the research itself through the activities that be directly relate 

6. Stop words

In [276]:
df.loc[0]['Content_Parsed_6']

'         decide  i want  move   lab  science communication see my   i   major route-planning   i usually travel fairly haphazardly  perhaps  change  career require less    QUOTATION_REPLACEMENT  attitude   first thing    completely revamp  date academic cv  -important list  publications  high-impact scientific journals suddenly seem less important  me  felt great  abandon  stifle format   academic research cv  attempt  reinvent myself within   punchy package    first hurdle   remove my list  publications  --high-impact journals  my ever-growing list  useful laboratory techniques i  face   rather sad realization  i  leave   little   way  skills  experience  document nevertheless  still felt good  dust   cobwebs    communication skills  i develop  my time  far   scientist  general scientists tend  acquire  develop communication skills     they need   my experience  appear   little room  research  devote  either personal  staff development    formal train environment i  already mention  

Finally, we can delete the intermediate columns:

In [277]:
df.head(1)

Unnamed: 0,id_x,first,second,wc,first_f,second_f,input,id_y,headline,tags,...,n_posts_author,column1,column2,Content,Content_Parsed_1,Content_Parsed_2,Content_Parsed_3,Content_Parsed_4,Content_Parsed_5,Content_Parsed_6
0,730,54,0,850,0.063529,0.0,1,6055,"diary of a british scientist, part 2: brushing...","[job market, europe]",...,7,no,yes,by so after deciding that i wanted...,by so after deciding that i wanted to mo...,by so after deciding that i wanted to mo...,by so after deciding that i wanted to mo...,by so after deciding that i wanted to mo...,by so after decide that i want to move f...,decide i want move lab science c...


In [278]:
list_columns = ["id_x", "input", "headline", "n_posts_author","date_seq","month_seq","year","column1","column2","text", "Content_Parsed_6"]
df = df[list_columns]

In [279]:
df = df.rename(columns={'Content_Parsed_6': 'Content_Parsed', 'id_x':'id','input':'Category_Code'})

df.head()

Unnamed: 0,id,Category_Code,headline,n_posts_author,date_seq,month_seq,year,column1,column2,text,Content_Parsed
0,730,1,"diary of a british scientist, part 2: brushing...",7,977,42,1999,no,yes,by so after deciding that i wanted...,decide i want move lab science c...
1,4200,3,alyson reed takes the helm at npa,1,2545,93,2003,no,no,"by n 4 september, a new force joined the...",n 4 september new force join struggle ...
2,4453,1,unveiling the blindness,1,2699,98,2004,no,no,"by n a daily basis, i strive to be the ...",n daily basis i strive laziest mexica...
3,286,2,talk yourself right into a job,247,7499,256,2017,no,yes,by i’m sure you’ve heard the expression use...,i’ sure you’ hear expression use describ...
4,1034,3,loan-repayment for biomedical researchers,84,1901,72,2001,no,yes,by whatever happe...,whatever happen fund proposal cong...


**IMPORTANT:**

We need to remember that our model will gather the latest news articles from different newspapers every time we want. For that reason, we not only need to take into account the peculiarities of the training set articles, but also possible ones that are present in the gathered news articles.

For this reason, possible peculiarities have been studied in the *05. News Scraping* folder.

## 2. Label coding

We'll create a dictionary with the label codification:

In [298]:
category_codes = {
    'first':1,
    'second':2,
    'third':3
}

df['Category'] = df['Category_Code']
df = df.replace({'Category':category_codes})

TypeError: Cannot compare types 'ndarray(dtype=int32)' and 'str'

In [None]:
# # Category mapping
# df['Category_Code'] = df['Category']
# df = df.replace({'Category_Code':category_codes})

In [297]:
df = df.rename(columns={'category':'Category_Code'})

df.head()

Unnamed: 0,id,Category_Code,headline,n_posts_author,date_seq,month_seq,year,column1,column2,text,Content_Parsed,Category
0,730,1,"diary of a british scientist, part 2: brushing...",7,977,42,1999,no,yes,by so after deciding that i wanted...,decide i want move lab science c...,1
1,4200,3,alyson reed takes the helm at npa,1,2545,93,2003,no,no,"by n 4 september, a new force joined the...",n 4 september new force join struggle ...,3
2,4453,1,unveiling the blindness,1,2699,98,2004,no,no,"by n a daily basis, i strive to be the ...",n daily basis i strive laziest mexica...,1
3,286,2,talk yourself right into a job,247,7499,256,2017,no,yes,by i’m sure you’ve heard the expression use...,i’ sure you’ hear expression use describ...,2
4,1034,3,loan-repayment for biomedical researchers,84,1901,72,2001,no,yes,by whatever happe...,whatever happen fund proposal cong...,3


## 3. Train - test split

We'll set apart a test set to prove the quality of our models. We'll do Cross Validation in the train set in order to tune the hyperparameters and then test performance on the unseen data of the test set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df['Content_Parsed'], 
                                                    df['Category_Code'], 
                                                    test_size=0.15, 
                                                    random_state=8)

In [None]:
print(len(X_train), len(y_train), len(df))

Since we don't have much observations (only 2.225), we'll choose a test set size of 15% of the full dataset.

## 4. Text representation

We have various options:

* Count Vectors as features
* TF-IDF Vectors as features
* Word Embeddings as features
* Text / NLP based features
* Topic Models as features

We'll use **TF-IDF Vectors** as features.

We have to define the different parameters:

* `ngram_range`: We want to consider both unigrams and bigrams.
* `max_df`: When building the vocabulary ignore terms that have a document
    frequency strictly higher than the given threshold
* `min_df`: When building the vocabulary ignore terms that have a document
    frequency strictly lower than the given threshold.
* `max_features`: If not None, build a vocabulary that only consider the top
    max_features ordered by term frequency across the corpus.

See `TfidfVectorizer?` for further detail.

It needs to be mentioned that we are implicitly scaling our data when representing it as TF-IDF features with the argument `norm`.

In [None]:
# Parameter election
ngram_range = (1,2)
min_df = 10
max_df = 1.
max_features = 300

We have chosen these values as a first approximation. Since the models that we develop later have a very good predictive power, we'll stick to these values. But it has to be mentioned that different combinations could be tried in order to improve even more the accuracy of the models.

In [None]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

In [None]:
print(len(category_codes))

category_codes.items()

Please note that we have fitted and then transformed the training set, but we have **only transformed** the **test set**.

We can use the Chi squared test in order to see what unigrams and bigrams are most correlated with each category:

In [None]:
from sklearn.feature_selection import chi2
import numpy as np

for Product, category_id in sorted(category_codes.items()):
    features_chi2 = chi2(features_train, labels_train == category_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}' category:".format(Product))
    print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-5:])))
    print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-2:])))
    print("")


As we can see, the unigrams correspond well to their category. However, bigrams do not. If we get the bigrams in our features:

In [None]:
bigrams

We can see there are only six. This means the unigrams have more correlation with the category than the bigrams, and since we're restricting the number of features to the most representative 300, only a few bigrams are being considered.

Let's save the files we'll need in the next steps:

In [None]:

# X_train
with open('Pickles/X_train.pickle', 'wb') as output:
    pickle.dump(X_train, output)
    
# X_test    
with open('Pickles/X_test.pickle', 'wb') as output:
    pickle.dump(X_test, output)
    
# y_train
with open('Pickles/y_train.pickle', 'wb') as output:
    pickle.dump(y_train, output)
    
# y_test
with open('Pickles/y_test.pickle', 'wb') as output:
    pickle.dump(y_test, output)
    
# df
with open('Pickles/df.pickle', 'wb') as output:
    pickle.dump(df, output)
    
# features_train
with open('Pickles/features_train.pickle', 'wb') as output:
    pickle.dump(features_train, output)

# labels_train
with open('Pickles/labels_train.pickle', 'wb') as output:
    pickle.dump(labels_train, output)

# features_test
with open('Pickles/features_test.pickle', 'wb') as output:
    pickle.dump(features_test, output)

# labels_test
with open('Pickles/labels_test.pickle', 'wb') as output:
    pickle.dump(labels_test, output)
    
# TF-IDF object
with open('Pickles/tfidf.pickle', 'wb') as output:
    pickle.dump(tfidf, output)