## Exercises

The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

Define a function named tokenize. It should take in a string and tokenize all the words in the string.

Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

For each dataframe, produce the following columns:

- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.

Ask yourself:

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

In [22]:
import pandas as pd
import numpy as np
import re
from requests import get
from bs4 import BeautifulSoup as soupify
import unicodedata
import nltk
from nltk.corpus import stopwords

In [165]:
def basic_clean(string):
    '''This function takes in a string and lowercases everything, normalizes unicode characters, and replaces
        anything that is not a letter, number, whits space, or single quote. It returns the cleaned string.'''
    cleaned = []
    string = string.str.lower()
    for item in string:
        item = unicodedata.normalize('NFKD', item).encode('ascii', 'ignore').decode('utf-8')
        item = re.sub(r'[^0-9a-z/\'\s]', '', item)
        cleaned.append(item)
    return cleaned

In [161]:
df = pd.read_csv('news_inshorts.csv', index_col=[0])

In [162]:
df = df.reset_index().drop(columns='index')

In [163]:
df.title[1]

'Mohammad Shami confirms he is going to Australia ahead of T20 WC, shares pics from flight'

In [167]:
df['cleaned'] = basic_clean(df.title)

In [168]:
df

Unnamed: 0,title,author,date,content,category,cleaned
0,Ben Stokes' flying effort at boundary rope to ...,Anmol Sharma,10:52 pm on 12 Oct 2022,A video has gone viral showing England all-rou...,sports,ben stokes' flying effort at boundary rope to ...
1,Mohammad Shami confirms he is going to Austral...,Anmol Sharma,10:12 pm on 12 Oct 2022,Pacer Mohammad Shami on Wednesday confirmed th...,sports,mohammad shami confirms he is going to austral...
2,India's national record-holding discus thrower...,Anmol Sharma,10:48 pm on 12 Oct 2022,"Kamalpreet Kaur, who holds national record in ...",sports,india's national recordholding discus thrower ...
3,ICC names 2 strike bowlers for every team at T...,Anmol Sharma,01:53 pm on 12 Oct 2022,ICC has named two strike bowlers for every tea...,sports,icc names 2 strike bowlers for every team at t...
4,Who are the top 5 batters as per latest T20I r...,Anmol Sharma,11:13 pm on 12 Oct 2022,New Zealand wicketkeeper-batter Devon Conway l...,sports,who are the top 5 batters as per latest t20i r...
...,...,...,...,...,...,...
79,"Indian, US oil and gas companies sign 4 MoUs f...",Ambarish Awale,08:20 pm on 12 Oct 2022,Indian and American gas and oil companies sign...,business,indian us oil and gas companies sign 4 mous fo...
80,OPEC cuts 2022 oil demand growth for 4th time ...,Dharini Mudgal,08:50 pm on 12 Oct 2022,Organization of the Petroleum Exporting Countr...,business,opec cuts 2022 oil demand growth for 4th time ...
81,Adda247 raises $35 million led by WestBridge C...,Ambarish Awale,10:09 pm on 12 Oct 2022,Ed-tech platform ﻿Adda247 announced raising $3...,business,adda247 raises 35 million led by westbridge ca...
82,India's projected debt ratio to be 84% of GDP:...,Ambarish Awale,10:26 pm on 12 Oct 2022,"Paolo Mauro, Deputy Director, Fiscal Affairs D...",business,india's projected debt ratio to be 84 of gdp imf


In [24]:
url = 'https://codeup.com/blog/'
header = {'User-Agent': 'Codeup Data Science'}

In [25]:
get(url, headers=header)

<Response [200]>

In [26]:
soup = soupify(get(url, headers=header).content)

In [27]:
soup.select('a.more-link')[0]['href']

'https://codeup.com/codeup-news/dei-report/'

In [28]:
blog_posts = [link['href'] for link in soup.select('a.more-link')]

In [29]:
def get_blog_urls(base_url, header={'User-Agent': 'Codeup Data Science'}):
    soup = soupify(get(url, headers=header).content)
    return [link['href'] for link in soup.select('a.more-link')]

In [30]:
get_blog_urls(url)

['https://codeup.com/codeup-news/dei-report/',
 'https://codeup.com/codeup-news/diversity-and-inclusion-award/',
 'https://codeup.com/featured/financing-career-transition/',
 'https://codeup.com/tips-for-prospective-students/tips-for-women/',
 'https://codeup.com/cloud-administration/cloud-computing-and-aws/',
 'https://codeup.com/codeup-news/c-suite-award-stephen-noteboom/']

In [31]:
article_soup = soupify(get(
    'https://codeup.com/codeup-news/dei-report/',
    headers=header
).content)

In [32]:
article_soup.select_one('h1.entry-title').text

'Diversity Equity and Inclusion Report'

In [33]:
article_soup.select_one('div.entry-content').text.strip()

'Codeup is excited to launch our first Diversity Equity, and Inclusion (DEI) report! In over eight years as an organization, we’ve implemented policies and grown our DEI efforts. We are extremely proud of the progress we’ve made as a staff and Codeup community, and we recognize there is more to learn. This report captures some of the ways that we’ve lived our value of Cultivating Inclusive Growth, and how we will continue doing so as we look to the future.\nWe wanted to shine a light on the demographics of our students and staff, and in particular how that compares to the tech industry as a whole. How we collect, organize, and share employee demographic data is informed by standards set by the Equal Employment Opportunity Commission (EEOC).\nWe are proud to celebrate how we’ve grown and are motivated and committed to do more and be better. To view the report visit the link here, or download it below.'

In [34]:
def get_blog_content(base_url):
    blog_links = get_blog_urls(base_url)
    all_blogs = []
    for blog in blog_links:
        blog_soup = soupify(
            get(blog,
                headers=header).content)
        blog_content = {'title': blog_soup.select_one(
            'h1.entry-title').text,
        'content': blog_soup.select_one(
            'div.entry-content').text.strip()}
        all_blogs.append(blog_content)
    return all_blogs

In [76]:
codeup_df = pd.DataFrame(get_blog_content(url))

In [77]:
codeup_df

Unnamed: 0,title,content
0,Diversity Equity and Inclusion Report,Codeup is excited to launch our first Diversit...
1,Codeup Honored as SABJ Diversity and Inclusion...,Codeup has been named the 2022 Diversity and I...
2,How Can I Finance My Career Transition?,Deciding to transition into a tech career is a...
3,Tips for Women Beginning a Career in Tech,"Codeup strongly values diversity, and inclusio..."
4,What is Cloud Computing and AWS?,With many companies switching to cloud service...
5,2022 SABJ C-Suite Award Winner: Stephen Noteboom,"Codeup’s Chief Operating Officer, Stephen Note..."


In [169]:
article = basic_clean(codeup_df.content)
article

['codeup is excited to launch our first diversity equity and inclusion dei report in over eight years as an organization weve implemented policies and grown our dei efforts we are extremely proud of the progress weve made as a staff and codeup community and we recognize there is more to learn this report captures some of the ways that weve lived our value of cultivating inclusive growth and how we will continue doing so as we look to the future\nwe wanted to shine a light on the demographics of our students and staff and in particular how that compares to the tech industry as a whole how we collect organize and share employee demographic data is informed by standards set by the equal employment opportunity commission eeoc\nwe are proud to celebrate how weve grown and are motivated and committed to do more and be better to view the report visit the link here or download it below',
 'codeup has been named the 2022 diversity and inclusion award winner from the san antonio business journal

In [170]:
def tokenize(string):
    token = []
    tokenizer = nltk.tokenize.ToktokTokenizer()
    for item in string:
        item = tokenizer.tokenize(item, return_str=True)
        token.append(item)
    return token

In [171]:
article = tokenize(article)
article

['codeup is excited to launch our first diversity equity and inclusion dei report in over eight years as an organization weve implemented policies and grown our dei efforts we are extremely proud of the progress weve made as a staff and codeup community and we recognize there is more to learn this report captures some of the ways that weve lived our value of cultivating inclusive growth and how we will continue doing so as we look to the future\nwe wanted to shine a light on the demographics of our students and staff and in particular how that compares to the tech industry as a whole how we collect organize and share employee demographic data is informed by standards set by the equal employment opportunity commission eeoc\nwe are proud to celebrate how weve grown and are motivated and committed to do more and be better to view the report visit the link here or download it below',
 'codeup has been named the 2022 diversity and inclusion award winner from the san antonio business journal

In [214]:
codeup_df['clean'] = article

['codeup is excited to launch our first diversity equity and inclusion dei report in over eight years as an organization weve implemented policies and grown our dei efforts we are extremely proud of the progress weve made as a staff and codeup community and we recognize there is more to learn this report captures some of the ways that weve lived our value of cultivating inclusive growth and how we will continue doing so as we look to the future\nwe wanted to shine a light on the demographics of our students and staff and in particular how that compares to the tech industry as a whole how we collect organize and share employee demographic data is informed by standards set by the equal employment opportunity commission eeoc\nwe are proud to celebrate how weve grown and are motivated and committed to do more and be better to view the report visit the link here or download it below',
 'codeup has been named the 2022 diversity and inclusion award winner from the san antonio business journal

In [215]:
codeup_df.head()

Unnamed: 0,title,content,clean
0,Diversity Equity and Inclusion Report,Codeup is excited to launch our first Diversit...,codeup is excited to launch our first diversit...
1,Codeup Honored as SABJ Diversity and Inclusion...,Codeup has been named the 2022 Diversity and I...,codeup has been named the 2022 diversity and i...
2,How Can I Finance My Career Transition?,Deciding to transition into a tech career is a...,deciding to transition into a tech career is a...
3,Tips for Women Beginning a Career in Tech,"Codeup strongly values diversity, and inclusio...",codeup strongly values diversity and inclusion...
4,What is Cloud Computing and AWS?,With many companies switching to cloud service...,with many companies switching to cloud service...


In [174]:
def stem(string):
    ps = nltk.porter.PorterStemmer()
    stemmed = []
    for item in string:
        stems = [ps.stem(word) for word in item.split()]
        article_stemmed = ' '.join(stems)
        stemmed.append(article_stemmed)
    return stemmed

In [216]:
codeup_stem = stem(article)
codeup_stem

['codeup is excit to launch our first divers equiti and inclus dei report in over eight year as an organ weve implement polici and grown our dei effort we are extrem proud of the progress weve made as a staff and codeup commun and we recogn there is more to learn thi report captur some of the way that weve live our valu of cultiv inclus growth and how we will continu do so as we look to the futur we want to shine a light on the demograph of our student and staff and in particular how that compar to the tech industri as a whole how we collect organ and share employe demograph data is inform by standard set by the equal employ opportun commiss eeoc we are proud to celebr how weve grown and are motiv and commit to do more and be better to view the report visit the link here or download it below',
 'codeup ha been name the 2022 divers and inclus award winner from the san antonio busi journal we are thrill to be among those that take pride in ensur that divers equiti and inclus dei are prio

In [217]:
codeup_df['stemmed'] = codeup_stem

In [218]:
codeup_df.head()

Unnamed: 0,title,content,clean,stemmed
0,Diversity Equity and Inclusion Report,Codeup is excited to launch our first Diversit...,codeup is excited to launch our first diversit...,codeup is excit to launch our first divers equ...
1,Codeup Honored as SABJ Diversity and Inclusion...,Codeup has been named the 2022 Diversity and I...,codeup has been named the 2022 diversity and i...,codeup ha been name the 2022 divers and inclus...
2,How Can I Finance My Career Transition?,Deciding to transition into a tech career is a...,deciding to transition into a tech career is a...,decid to transit into a tech career is a big s...
3,Tips for Women Beginning a Career in Tech,"Codeup strongly values diversity, and inclusio...",codeup strongly values diversity and inclusion...,codeup strongli valu divers and inclus in hono...
4,What is Cloud Computing and AWS?,With many companies switching to cloud service...,with many companies switching to cloud service...,with mani compani switch to cloud servic and i...


In [178]:
def lemmatize(string):
    wnl = nltk.stem.WordNetLemmatizer()
    lemlem = []
    for item in string:
        lems = [wnl.lemmatize(word) for word in item.split()]
        article_lemmatized = ' '.join(lems)
        lemlem.append(article_lemmatized)
    return lemlem

In [219]:
codeup_lemmed = lemmatize(article)
codeup_lemmed

['codeup is excited to launch our first diversity equity and inclusion dei report in over eight year a an organization weve implemented policy and grown our dei effort we are extremely proud of the progress weve made a a staff and codeup community and we recognize there is more to learn this report capture some of the way that weve lived our value of cultivating inclusive growth and how we will continue doing so a we look to the future we wanted to shine a light on the demographic of our student and staff and in particular how that compare to the tech industry a a whole how we collect organize and share employee demographic data is informed by standard set by the equal employment opportunity commission eeoc we are proud to celebrate how weve grown and are motivated and committed to do more and be better to view the report visit the link here or download it below',
 'codeup ha been named the 2022 diversity and inclusion award winner from the san antonio business journal we are thrilled 

In [220]:
codeup_df['lemmatized'] = codeup_lemmed

In [221]:
codeup_df.head()

Unnamed: 0,title,content,clean,stemmed,lemmatized
0,Diversity Equity and Inclusion Report,Codeup is excited to launch our first Diversit...,codeup is excited to launch our first diversit...,codeup is excit to launch our first divers equ...,codeup is excited to launch our first diversit...
1,Codeup Honored as SABJ Diversity and Inclusion...,Codeup has been named the 2022 Diversity and I...,codeup has been named the 2022 diversity and i...,codeup ha been name the 2022 divers and inclus...,codeup ha been named the 2022 diversity and in...
2,How Can I Finance My Career Transition?,Deciding to transition into a tech career is a...,deciding to transition into a tech career is a...,decid to transit into a tech career is a big s...,deciding to transition into a tech career is a...
3,Tips for Women Beginning a Career in Tech,"Codeup strongly values diversity, and inclusio...",codeup strongly values diversity and inclusion...,codeup strongli valu divers and inclus in hono...,codeup strongly value diversity and inclusion ...
4,What is Cloud Computing and AWS?,With many companies switching to cloud service...,with many companies switching to cloud service...,with mani compani switch to cloud servic and i...,with many company switching to cloud service a...


In [183]:
def remove_stopwords(string, extra_words=[], exclude_words=[]):
    '''This function takes in a dataframe column and optional parameters and returns each observation in the column
        with the stopwords removed.'''
    stop_list = stopwords.words('english')
    stopped = []
    for item in string:
        for word in extra_words:
            if item not in stop_list:
                stop_list.append(word)
        for word in exclude_words:    
            if item in stop_list:
                stop_list.remove(word) 
        words = item.split()
        words_stopped = [word for word in words if word not in stop_list]
        article_stopped = ' '.join(words_stopped)
        stopped.append(article_stopped)
    return stopped

In [222]:
codeup_stopped = remove_stopwords(codeup_df.clean)
codeup_stopped

['codeup excited launch first diversity equity inclusion dei report eight years organization weve implemented policies grown dei efforts extremely proud progress weve made staff codeup community recognize learn report captures ways weve lived value cultivating inclusive growth continue look future wanted shine light demographics students staff particular compares tech industry whole collect organize share employee demographic data informed standards set equal employment opportunity commission eeoc proud celebrate weve grown motivated committed better view report visit link download',
 'codeup named 2022 diversity inclusion award winner san antonio business journal thrilled among take pride ensuring diversity equity inclusion dei priorities work environments learn efforts please check dei report 2022',
 'deciding transition tech career big step significant commitment often deciding commit journey nature main obstacle finding way finance training codeup recognize many students career tra

In [223]:
codeup_df['clean'] = codeup_stopped

In [224]:
codeup_df.head()

Unnamed: 0,title,content,clean,stemmed,lemmatized
0,Diversity Equity and Inclusion Report,Codeup is excited to launch our first Diversit...,codeup excited launch first diversity equity i...,codeup is excit to launch our first divers equ...,codeup is excited to launch our first diversit...
1,Codeup Honored as SABJ Diversity and Inclusion...,Codeup has been named the 2022 Diversity and I...,codeup named 2022 diversity inclusion award wi...,codeup ha been name the 2022 divers and inclus...,codeup ha been named the 2022 diversity and in...
2,How Can I Finance My Career Transition?,Deciding to transition into a tech career is a...,deciding transition tech career big step signi...,decid to transit into a tech career is a big s...,deciding to transition into a tech career is a...
3,Tips for Women Beginning a Career in Tech,"Codeup strongly values diversity, and inclusio...",codeup strongly values diversity inclusion hon...,codeup strongli valu divers and inclus in hono...,codeup strongly value diversity and inclusion ...
4,What is Cloud Computing and AWS?,With many companies switching to cloud service...,many companies switching cloud services implem...,with mani compani switch to cloud servic and i...,with many company switching to cloud service a...


In [226]:
stemmed_stopped = stem(codeup_df.clean)
stemmed_stopped

['codeup excit launch first divers equiti inclus dei report eight year organ weve implement polici grown dei effort extrem proud progress weve made staff codeup commun recogn learn report captur way weve live valu cultiv inclus growth continu look futur want shine light demograph student staff particular compar tech industri whole collect organ share employe demograph data inform standard set equal employ opportun commiss eeoc proud celebr weve grown motiv commit better view report visit link download',
 'codeup name 2022 divers inclus award winner san antonio busi journal thrill among take pride ensur divers equiti inclus dei prioriti work environ learn effort pleas check dei report 2022',
 'decid transit tech career big step signific commit often decid commit journey natur main obstacl find way financ train codeup recogn mani student career transition attend one program sometim requir sacrific luckili sever way help financ career transit ultim lead new career youv decid pursu program

In [227]:
# fixing stemmed column to the cleaned, tokenized, and stopwords removed
codeup_df['stemmed'] = stemmed_stopped
codeup_df.head()

Unnamed: 0,title,content,clean,stemmed,lemmatized
0,Diversity Equity and Inclusion Report,Codeup is excited to launch our first Diversit...,codeup excited launch first diversity equity i...,codeup excit launch first divers equiti inclus...,codeup is excited to launch our first diversit...
1,Codeup Honored as SABJ Diversity and Inclusion...,Codeup has been named the 2022 Diversity and I...,codeup named 2022 diversity inclusion award wi...,codeup name 2022 divers inclus award winner sa...,codeup ha been named the 2022 diversity and in...
2,How Can I Finance My Career Transition?,Deciding to transition into a tech career is a...,deciding transition tech career big step signi...,decid transit tech career big step signific co...,deciding to transition into a tech career is a...
3,Tips for Women Beginning a Career in Tech,"Codeup strongly values diversity, and inclusio...",codeup strongly values diversity inclusion hon...,codeup strongli valu divers inclus honor ameri...,codeup strongly value diversity and inclusion ...
4,What is Cloud Computing and AWS?,With many companies switching to cloud service...,many companies switching cloud services implem...,mani compani switch cloud servic implement clo...,with many company switching to cloud service a...


In [228]:
lemmed_stopped = lemmatize(codeup_df.clean)
lemmed_stopped

['codeup excited launch first diversity equity inclusion dei report eight year organization weve implemented policy grown dei effort extremely proud progress weve made staff codeup community recognize learn report capture way weve lived value cultivating inclusive growth continue look future wanted shine light demographic student staff particular compare tech industry whole collect organize share employee demographic data informed standard set equal employment opportunity commission eeoc proud celebrate weve grown motivated committed better view report visit link download',
 'codeup named 2022 diversity inclusion award winner san antonio business journal thrilled among take pride ensuring diversity equity inclusion dei priority work environment learn effort please check dei report 2022',
 'deciding transition tech career big step significant commitment often deciding commit journey nature main obstacle finding way finance training codeup recognize many student career transitioners atte

In [229]:
# fixing lemmatized column (it did not have stopwords removed)
codeup_df['lemmatized'] = lemmed_stopped
codeup_df.head()

Unnamed: 0,title,content,clean,stemmed,lemmatized
0,Diversity Equity and Inclusion Report,Codeup is excited to launch our first Diversit...,codeup excited launch first diversity equity i...,codeup excit launch first divers equiti inclus...,codeup excited launch first diversity equity i...
1,Codeup Honored as SABJ Diversity and Inclusion...,Codeup has been named the 2022 Diversity and I...,codeup named 2022 diversity inclusion award wi...,codeup name 2022 divers inclus award winner sa...,codeup named 2022 diversity inclusion award wi...
2,How Can I Finance My Career Transition?,Deciding to transition into a tech career is a...,deciding transition tech career big step signi...,decid transit tech career big step signific co...,deciding transition tech career big step signi...
3,Tips for Women Beginning a Career in Tech,"Codeup strongly values diversity, and inclusio...",codeup strongly values diversity inclusion hon...,codeup strongli valu divers inclus honor ameri...,codeup strongly value diversity inclusion hono...
4,What is Cloud Computing and AWS?,With many companies switching to cloud service...,many companies switching cloud services implem...,mani compani switch cloud servic implement clo...,many company switching cloud service implement...


In [225]:
# test optional parameters
remove_stopwords(article, extra_words=['weve'])

['codeup excited launch first diversity equity inclusion dei report eight years organization implemented policies grown dei efforts extremely proud progress made staff codeup community recognize learn report captures ways lived value cultivating inclusive growth continue look future wanted shine light demographics students staff particular compares tech industry whole collect organize share employee demographic data informed standards set equal employment opportunity commission eeoc proud celebrate grown motivated committed better view report visit link download',
 'codeup named 2022 diversity inclusion award winner san antonio business journal thrilled among take pride ensuring diversity equity inclusion dei priorities work environments learn efforts please check dei report 2022',
 'deciding transition tech career big step significant commitment often deciding commit journey nature main obstacle finding way finance training codeup recognize many students career transitioners attending

In [134]:
url = 'https://inshorts.com/en/read' 

In [135]:
def get_cats(base_url):
    '''This function takes in the url from the inshorts.com base page and acquires a list of categories using 
        BeautifulSoup. It returns the list of categories in all lowercase letters.'''
    soup = soupify(get(base_url).content)
    return [cat.text.lower() for cat in soup.find_all('li')[1:]]

In [138]:
def get_all_shorts(base_url):
    '''This function takes in a url and gets the list of categories using the get_cats function. It acquires the
        text for all articles in each category using BeautifulSoup and returns a dictionary of the title, category, 
        and text body for all articles.'''
    cats = get_cats(base_url)
    all_articles = []
    for cat in cats:
        cat_url = base_url + '/' + cat
        cat_soup = soupify(get(cat_url).content)
        cat_titles = [
            title.text for title in cat_soup.find_all('span', itemprop='headline')
        ]
        cat_bodies = [
            body.text for body in cat_soup.find_all('div', itemprop='articleBody')]
        cat_articles = [{'title': title,
        'category': cat,
        'body': body} for title, body in zip(
        cat_titles, cat_bodies)]
        all_articles.extend(cat_articles)
    return all_articles

In [203]:
all_articles = get_all_shorts(url)

In [204]:
all_articles

[{'title': "Kashmir's famous Dal Lake freezes",
  'category': 'india',
  'body': "After the recent snowfall in upper reaches of Kashmir and Himalayan peak, Srinagar witnessed season's coldest night with temperature dipping to -3.7 degree Celsius. According to Meteorological Department(MeT), continued cold conditions have resulted in the freezing of water bodies including the Dal Lake."},
 {'title': 'Nigerian weightlifter in dope net, India may gain',
  'category': 'india',
  'body': 'India may move up after Nigerian weightlifter Chika Amalaha’s A dope sample returned positive. Santoshi Matsa’s bronze and Swati Singh\'s 4th position may be upgraded in women’s 53kg. “We shouldn’t presume anything just yet, once she has provided a B sample that is when we can decide on the re—awarding of medals and appropriate actions" said CWG-Federation chief-executive Mike Hooper.'},
 {'title': 'Infosys Gifts Sikka Shares Worth Rs 8.2cr',
  'category': 'india',
  'body': 'In a regulatory filing to the 

In [205]:
all_articles = pd.DataFrame(all_articles)

In [206]:
all_articles.head()

Unnamed: 0,title,category,body
0,Kashmir's famous Dal Lake freezes,india,After the recent snowfall in upper reaches of ...
1,"Nigerian weightlifter in dope net, India may gain",india,India may move up after Nigerian weightlifter ...
2,Infosys Gifts Sikka Shares Worth Rs 8.2cr,india,"In a regulatory filing to the BSE on Friday, I..."
3,Oldest woman in India passes away,india,"Kunjannam, a 112-yr-old woman from Parannur (K..."
4,"Afghanistan wins SAFF title, spoils India's ha...",india,Afghanistan won their maiden-SAFF Football Cha...


In [196]:
cleaned_articles = basic_clean(all_articles.body)
cleaned_articles

['kunjannam a 112yrold woman from parannur kerala who was declared the oldest living woman in india by limca book of records passed away tuesday morning she had been unwell for some days and had been unable to eat anything since monday in an interview kunjannam had revealed that vegetarian food and regular walks were what kept her healthy',
 "afghanistan won their maidensaff football championship after defeating india 20 in the finals at kathmandu and spoilt india's bid to win a hattrick of safftrophies mustafa azadzoy8thminute and sanjar ahmadi63rdminute scored goals for afghans who lost championship to india 04 in new delhi twoyears back from indianside robin singh and jeje lalpeklhua were praiseworthy but couldn't save india from defeat",
 'the indian navy has a new communication system critical for passing coded orders to nuclear submarines commissioned at ins kattabommantirunelveli tamil nadu the new facility will boost our ability to communicate with submarines which have trailin

In [197]:
cleaned_articles = tokenize(cleaned_articles)

In [207]:
all_articles['cleaned_body'] = cleaned_articles
all_articles.head()

Unnamed: 0,title,category,body,cleaned_body
0,Kashmir's famous Dal Lake freezes,india,After the recent snowfall in upper reaches of ...,kunjannam a 112yrold woman from parannur keral...
1,"Nigerian weightlifter in dope net, India may gain",india,India may move up after Nigerian weightlifter ...,afghanistan won their maidensaff football cham...
2,Infosys Gifts Sikka Shares Worth Rs 8.2cr,india,"In a regulatory filing to the BSE on Friday, I...",the indian navy ha a new communication system ...
3,Oldest woman in India passes away,india,"Kunjannam, a 112-yr-old woman from Parannur (K...",in the cwg men ' s hockey semifinal against ne...
4,"Afghanistan wins SAFF title, spoils India's ha...",india,Afghanistan won their maiden-SAFF Football Cha...,the billiards and snooker association of mahar...


In [208]:
lemmatized_articles = lemmatize(cleaned_articles)

In [209]:
all_articles['lemmatized'] = lemmatized_articles
all_articles.head()

Unnamed: 0,title,category,body,cleaned_body,lemmatized
0,Kashmir's famous Dal Lake freezes,india,After the recent snowfall in upper reaches of ...,kunjannam a 112yrold woman from parannur keral...,kunjannam a 112yrold woman from parannur keral...
1,"Nigerian weightlifter in dope net, India may gain",india,India may move up after Nigerian weightlifter ...,afghanistan won their maidensaff football cham...,afghanistan won their maidensaff football cham...
2,Infosys Gifts Sikka Shares Worth Rs 8.2cr,india,"In a regulatory filing to the BSE on Friday, I...",the indian navy ha a new communication system ...,the indian navy ha a new communication system ...
3,Oldest woman in India passes away,india,"Kunjannam, a 112-yr-old woman from Parannur (K...",in the cwg men ' s hockey semifinal against ne...,in the cwg men ' s hockey semifinal against ne...
4,"Afghanistan wins SAFF title, spoils India's ha...",india,Afghanistan won their maiden-SAFF Football Cha...,the billiards and snooker association of mahar...,the billiards and snooker association of mahar...


In [210]:
stemmed_articles = stem(cleaned_articles)

In [211]:
all_articles['stemmed'] = stemmed_articles
all_articles.head()

Unnamed: 0,title,category,body,cleaned_body,lemmatized,stemmed
0,Kashmir's famous Dal Lake freezes,india,After the recent snowfall in upper reaches of ...,kunjannam a 112yrold woman from parannur keral...,kunjannam a 112yrold woman from parannur keral...,kunjannam a 112yrold woman from parannur keral...
1,"Nigerian weightlifter in dope net, India may gain",india,India may move up after Nigerian weightlifter ...,afghanistan won their maidensaff football cham...,afghanistan won their maidensaff football cham...,afghanistan won their maidensaff footbal champ...
2,Infosys Gifts Sikka Shares Worth Rs 8.2cr,india,"In a regulatory filing to the BSE on Friday, I...",the indian navy ha a new communication system ...,the indian navy ha a new communication system ...,the indian navi ha a new commun system critic ...
3,Oldest woman in India passes away,india,"Kunjannam, a 112-yr-old woman from Parannur (K...",in the cwg men ' s hockey semifinal against ne...,in the cwg men ' s hockey semifinal against ne...,in the cwg men ' s hockey semifin against new ...
4,"Afghanistan wins SAFF title, spoils India's ha...",india,Afghanistan won their maiden-SAFF Football Cha...,the billiards and snooker association of mahar...,the billiards and snooker association of mahar...,the billiard and snooker associ of maharashtra...


In [212]:
news_df = all_articles

In [213]:
news_df

Unnamed: 0,title,category,body,cleaned_body,lemmatized,stemmed
0,Kashmir's famous Dal Lake freezes,india,After the recent snowfall in upper reaches of ...,kunjannam a 112yrold woman from parannur keral...,kunjannam a 112yrold woman from parannur keral...,kunjannam a 112yrold woman from parannur keral...
1,"Nigerian weightlifter in dope net, India may gain",india,India may move up after Nigerian weightlifter ...,afghanistan won their maidensaff football cham...,afghanistan won their maidensaff football cham...,afghanistan won their maidensaff footbal champ...
2,Infosys Gifts Sikka Shares Worth Rs 8.2cr,india,"In a regulatory filing to the BSE on Friday, I...",the indian navy ha a new communication system ...,the indian navy ha a new communication system ...,the indian navi ha a new commun system critic ...
3,Oldest woman in India passes away,india,"Kunjannam, a 112-yr-old woman from Parannur (K...",in the cwg men ' s hockey semifinal against ne...,in the cwg men ' s hockey semifinal against ne...,in the cwg men ' s hockey semifin against new ...
4,"Afghanistan wins SAFF title, spoils India's ha...",india,Afghanistan won their maiden-SAFF Football Cha...,the billiards and snooker association of mahar...,the billiards and snooker association of mahar...,the billiard and snooker associ of maharashtra...
...,...,...,...,...,...,...
280,Record 5.4 lakh vehicles sold during Navratri ...,automobile,Federation of Automobile Dealers Associations ...,a report by china passenger car association ha...,a report by china passenger car association ha...,a report by china passeng car associ ha reveal...
281,Vintage cars on display to promote wildlife pr...,automobile,"To create awareness about wildlife week, the K...",international road federation irf ha urged roa...,international road federation irf ha urged roa...,intern road feder irf ha urg road transport mi...
282,UP govt announces new EV policy to attract ₹30...,automobile,The Uttar Pradesh government announced a new e...,to create awareness about wildlife week the ka...,to create awareness about wildlife week the ka...,to creat awar about wildlif week the karnataka...
283,Withdraw rule that makes 6 airbags mandatory i...,automobile,International Road Federation (IRF) has urged ...,passenger vehicle wholesale in india surged by...,passenger vehicle wholesale in india surged by...,passeng vehicl wholesal in india surg by 92 to...
