### Data Preparation Exercises
The end result of this exercise should be a file named `prepare.py` that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

In [20]:
import unicodedata
import re
import json
import nltk
import pandas as pd
import acquire

from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

from time import strftime

1. Define a function named `basic_clean`. It should take in a string and apply some basic text cleaning to it:

 - Lowercase everything
 - Normalize unicode characters
 - Replace anything that is not a letter, number, whitespace or a single quote.

In [2]:
text = """
Over 250,000 without power as major winter storm slogs east
A serious ice storm is hitting from eastern Texas to southwestern Ohio, with heavy snow to the north


Ice coats tree branches in Dallas on Feb. 3. (LM Otero/AP)
More than 100 million Americans from the Permian Basin of West Texas to Maine’s northern border are included in winter weather advisories or winter storm and ice storm warnings, a swath of weather alerts about 2,000 miles long.

In Tennessee, the number of power outages had spiked to 140,000 Thursday afternoon, many of them near Memphis. Outage numbers were also climbing in Ohio (26,000) and Kentucky (15,000), according to PowerOutage.us, which tracks outages nationwide. The outages were mostly because of ice accumulation.

Outage numbers in Texas and Arkansas, which peaked Thursday morning over 70,000 and 25,000, respectively, were slowly declining in the afternoon as freezing rain transitioned more to sleet and snow, which do not accumulate as readily on trees and powerlines. The outage count in Texas had dropped below 50,000.

Air travel nationwide was plagued due to the sprawling storm. There were more than 5,500 cancellations and 3,300 delays, according to the tracking website FlightAware — with many concentrated in Texas.

More than 1,100 flights departing from or scheduled to arrive at Dallas Fort Worth International Airport were canceled. The airport tweeted earlier that runways were being treated for snow and ice. More than 400 combined flights were canceled at Austin-Bergstrom International Airport, and more than 350 were canceled at Dallas’s Love Field.

4,700 flights canceled Thursday as winter storms slam Midwest, southern U.S.

Substantial accumulations of ice, enough to snap tree limbs and trigger power outages, were expected from roughly eastern Texas to southwestern Ohio. To the north, heavy snowfall, exceeding 6 inches, was projected in St. Louis, Indianapolis, Cleveland, Buffalo and Burlington, Vt., into Friday.

The National Weather Service warned of hazardous road conditions throughout the affected areas, which it said would probably see well-below-average temperatures for at least the next couple of days.

Parts of the Mississippi River valley and the Great Plains could record temperatures 20 to 40 degrees below average, the Weather Service said. It called the storm “large, prolonged, and significant.”

In the Deep South, the Weather Service projected elevated chances for severe thunderstorms, particularly in a zone from southeastern Mississippi to central and southern Alabama where a tornado watch was in effect until 6 p.m. Thursday.

The Weather Service in Birmingham warned of a “large and extremely dangerous tornado” near Sawyerville, Ala., about 30 miles southwest of Tuscaloosa, just after 3 p.m. Eastern time, where there were reports of damage on social media.

On Feb. 2, a major winter storm began dumping a wintry mess on more than 80 million Americans across the Midwest and South. (The Washington Post)
Freezing rain was ongoing in eastern Texas, eastern Arkansas, western Tennessee, northern Kentucky and southwest Ohio, including Memphis and Cincinnati.

Some of the worst conditions were in southwest Tennessee, with over 120,000 power outages reported in Shelby County, home to Memphis. As much as half an inch of ice had accumulated on tree limbs, which were crackling under its weight.

“Travel conditions are bad and getting worse” around Cincinnati, where there had been a glaze of ice, the Weather Service tweeted. The ice was anticipated to switch to snow in Cincinnati during the evening.

The Weather Service issued a special bulletin for the Indiana-Kentucky border area, southern Ohio and far western Pennsylvania, warning of “a long duration of sleet and freezing rain” continuing into Thursday evening.

In Dallas, freezing rain and sleet had changed to snow late morning, and a heavy burst passed through the city midday. "Road conditions are treacherous,” the Weather Service tweeted. Snow was beginning to taper off in Dallas late in the afternoon.

Farther north, steady snow was falling in an extensive zone from Little Rock to St. Louis to Cleveland to Buffalo to Burlington, Vt.

Arctic high pressure parked over the Dakotas is supplying frigid air for a front that stretches from southeast Louisiana to Nashville to Pittsburgh to northern Maine.

Moisture riding up and over the cold front from the south is falling into a subfreezing air mass, hardening into ice on contact and quickly accreting into a hazardous glaze.

Meanwhile, flooding rain has been falling on the system’s warm side in the Deep South, with flash flood watches up in northern Alabama, northeast Georgia, southeast Tennessee and southwest North Carolina.

Frozen precipitation should end in Texas on Thursday evening with even a little sleet possible in Houston before it shuts down.

The heaviest icing is expected in eastern Arkansas, western Tennessee, northern Kentucky and parts of southern Ohio Thursday but should gradually taper off Thursday night, possibly ending as brief period of snow.

The most substantial icy precipitation will shift from the Tennessee Valley into West Virginia and central Pennsylvania on Thursday night into Friday morning. Pittsburgh and State College, Pa., could see significant icing overnight before precipitation flips to snow. Areas northwest of New York City in the Hudson Valley could see icy conditions Friday morning as well.

The snow should finally end around Little Rock and St. Louis early Thursday evening. However, it will continue well into the overnight hours in southeast Missouri, southern Illinois, southern Indiana, Indiana and the northern third of Ohio.

Extreme northwest Pennsylvania, Upstate New York around the Finger Lakes and the Tug Hill Plateau and Vermont will see snowfall well into Friday. The snow will enter Maine and New Hampshire in northern New England on Thursday afternoon.

The heaviest snow totals will be realized just north of the freezing rain and sleet line, with another 6 to 12 inches possible, especially in areas that haven’t seen snow yet into the interior Northeast.

The storm’s final act comes Friday in eastern New England, when morning rain will transition to freezing rain, sleet and all snow as temperatures crash into the 20s during the evening. A “flash freeze” overnight into early Saturday will cause any remnant moisture and slushy slop on the roadways to harden and freeze.

The power situation in Texas


Ahead of the storm, Texas Gov. Greg Abbott (R) said Wednesday that “no one can guarantee” there would not be power outages during the first significant test of the state’s power grid a year after a historic freeze killed hundreds of residents and left millions without power for days.

The number of outages as of Thursday were “certainly an inconvenience but not that big of a deal in terms of what you might expect from a wind storm and ice storm,” said Michael Webber, an energy resources professor at the University of Texas at Austin.

Outages last year were driven largely by a “massive imbalance between supply and demand,” he said, but Thursday seemed driven by wind or ice accumulation that can fell trees.

The power at Webber’s home was out but started to return during an interview with The Washington Post.

“An outage for two hours — not two and a half days,” he said, underlining the difference from the major storm in 2021.

Texas governor says ‘no one can guarantee’ there won’t be power outages in winter storm

Daniel Cohan, associate professor of environmental engineering at Rice University in Houston, said Thursday’s outages had been localized, “which anywhere in the world could get when you have an ice storm come through and knock down tree branches and affect local power lines.”

Ahead of the winter storm, Texas’s state grid reported that its electric generation units and transmission facilities had met new standards Webber said such inspections and winterization investments will have already been helpful, “but the system hasn’t really been tested yet.”

“This storm will not be as cold for as long across as wide an area as last year,” he said. “So it isn’t the same kind of test. It’s a simpler, milder storm in many ways, and the equipment should be in better shape.”

In the coming days, Webber said, it will be helpful to monitor how cold it gets in Houston. The state’s biggest city will largely dictate the level of power demand. He said he is also interested in how gas systems and gas-producing regions hold up to the cold, and how that will affect gas supply.

Cohan said he’s watching how the wind farms fare, whether major gas pipelines remain in working order and whether “our old fleet of power plants perform better than they did last year.”

“If not too many wind farms ice, gas supply stays adequate and not too many power plants fail in the freeze, then we’ll get through this fine and eke by with just enough supply,” Cohan said.

Snow and ice totals so far


The snow had already piled up to impressive and disruptive levels through Thursday midday, with dozens of locations logging double-digit accumulations. Leading the pack was Colorado Springs, where 22 inches had piled up.

Farther east, Chicago was initially predicted to be right on the fringe of snowfall, and the atmosphere delivered — Chicago’s Midway airport tallied 11 inches of snow, while O’Hare International only wound up with 5.6, roughly half that.

The ice event underway lagged the snow by about 12 hours, meaning totals will increase as Thursday wears on. But some areas have already seen significant icing.

Up to a half-inch of ice was reported in eastern Oklahoma, near the border with Arkansas. Nearly one-quarter inch had accumulated in Fort Smith, Ark.

In Texas, the counties of Hunt, Fannin and Collin, north and east of Dallas, reported tree damage from overnight icing, while the Weather Service reported a glaze of around 0.1 inches near Wichita Falls.

Little Rock had seen more sleet than freezing rain, with the Weather Service reporting 2.5 inches in north Little Rock.

Timothy Bella contributed to this report.
"""

In [4]:
def basic_clean(string):
    article = string.lower()
    article = unicodedata.normalize('NFKD', article)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')
    article = re.sub(r"[^a-z0-9'\s]", '', article)
    
    return article


In [5]:
text = basic_clean(text)
text



2. Define a function named `tokenize`. It should take in a string and tokenize all the words in the string.

In [6]:
def tokenize(string):
    tokenizer = nltk.tokenize.ToktokTokenizer()
    print(tokenizer.tokenize(string, return_str = True))

In [8]:
tokenize(text)

over 250000 without power as major winter storm slogs east
a serious ice storm is hitting from eastern texas to southwestern ohio with heavy snow to the north


ice coats tree branches in dallas on feb 3 lm oteroap

in tennessee the number of power outages had spiked to 140000 thursday afternoon many of them near memphis outage numbers were also climbing in ohio 26000 and kentucky 15000 according to poweroutageus which tracks outages nationwide the outages were mostly because of ice accumulation

outage numbers in texas and arkansas which peaked thursday morning over 70000 and 25000 respectively were slowly declining in the afternoon as freezing rain transitioned more to sleet and snow which do not accumulate as readily on trees and powerlines the outage count in texas had dropped below 50000

air travel nationwide was plagued due to the sprawling storm there were more than 5500 cancellations and 3300 delays according to the tracking website flightaware with many concentrated in texas


3. Define a function named `stem`. It should accept some text and return the text after applying stemming to all the words.

In [12]:
def stem(string):
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in string.split()]
    article_stemmed = ' '.join(stems)
    print(article_stemmed)

In [13]:
stem(text)

over 250000 without power as major winter storm slog east a seriou ice storm is hit from eastern texa to southwestern ohio with heavi snow to the north ice coat tree branch in dalla on feb 3 lm oteroap more than 100 million american from the permian basin of west texa to main northern border are includ in winter weather advisori or winter storm and ice storm warn a swath of weather alert about 2000 mile long in tennesse the number of power outag had spike to 140000 thursday afternoon mani of them near memphi outag number were also climb in ohio 26000 and kentucki 15000 accord to poweroutageu which track outag nationwid the outag were mostli becaus of ice accumul outag number in texa and arkansa which peak thursday morn over 70000 and 25000 respect were slowli declin in the afternoon as freez rain transit more to sleet and snow which do not accumul as readili on tree and powerlin the outag count in texa had drop below 50000 air travel nationwid wa plagu due to the sprawl storm there wer

4. Define a function named `lemmatize`. It should accept some text and return the text after applying lemmatization to each word.

In [14]:
def lemmatize(string):
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    article_lemmatized = ' '.join(lemmas)
    print(article_lemmatized)

In [15]:
lemmatize(text)



5. Define a function named `remove_stopwords`. It should accept some text and return the text after removing all the stopwords.

    This function should define two optional parameters, `extra_words` and `exclude_words`. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [16]:
def remove_stopwords(string, extra_words = [], exclude_words = []):
    stopword_list = stopwords.words('english')
    stopword_list = set(stopword_list) - set(exclude_words)
    stopword_list = stopword_list.union(set(extra_words))
    words = string.split()
    filtered_words = [word for word in words if word not in stopword_list]
    string_wo_stopwords = ' '.join(filtered_words)
    
    return string_wo_stopwords

In [17]:
remove_stopwords(text)



6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe `news_df`.

In [21]:
news_df = acquire.get_inshorts_articles()
news_df



  cards = soup.select('.news-card')


Unnamed: 0,title,author,content,date,category
0,RBI cancels licence of Maha-based Independence...,Shalini Ojha,RBI has cancelled licence of Maharashtra-based...,"03 Feb 2022,Thursday",business
1,Boost to EVs a big step: Windmill Capital,Roshan Gupta,"Increased use of EVs in public transport, spec...","03 Feb 2022,Thursday",business
2,Facebook parent Meta's $230-billion wipeout bi...,Pragya Swastik,Facebook's parent Meta's shares plunged 27% an...,"03 Feb 2022,Thursday",business
3,Bezos to dismantle 1927 Dutch bridge to let hi...,Kiran Khatri,The Netherlands' Rotterdam will reportedly dis...,"03 Feb 2022,Thursday",business
4,Facebook's daily active users fall for first t...,Pragya Swastik,Facebook has seen its daily active users (DAUs...,"03 Feb 2022,Thursday",business
...,...,...,...,...,...
95,"Told Deepika won't click pic with you, I'll do...",Kriti Kambiri,Actor Dhairya Karwa revealed that he went to D...,"03 Feb 2022,Thursday",entertainment
96,Convinced people in bus I was IPS; they got me...,Ria Kapoor,Actress Gul Panag said she once convinced bus ...,"03 Feb 2022,Thursday",entertainment
97,"Farhan, Shibani to host their wedding party at...",Mahima Kharbanda,Actor Farhan Akhtar and Shibani Dandekar will ...,"03 Feb 2022,Thursday",entertainment
98,Priyanka to star opposite Anthony Mackie in ac...,Kriti Kambiri,Actress Priyanka Chopra will be starring oppos...,"03 Feb 2022,Thursday",entertainment


7. Make another dataframe for the Codeup blog posts. Name the dataframe `codeup_df`.

In [23]:
codeup_df = acquire.get_blog_articles()
codeup_df



  links = [link.attrs['href'] for link in soup.select('.more-link')]


  return {


Unnamed: 0,title,published,content
0,Codeup Dallas Open House,"Nov 30, 2021",Come join us for the re-opening of our Dallas ...
1,Codeup’s Placement Team Continues Setting Records,"Nov 19, 2021",Our Placement Team is simply defined as a grou...
2,"IT Certifications 101: Why They Matter, and Wh...","Nov 18, 2021","AWS, Google, Azure, Red Hat, CompTIA…these are..."
3,A rise in cyber attacks means opportunities fo...,"Nov 17, 2021","In the last few months, the US has experienced..."
4,Use your GI Bill® benefits to Land a Job in Tech,"Nov 4, 2021","As the end of military service gets closer, ma..."
5,Which program is right for me: Cyber Security ...,"Oct 28, 2021",What IT Career should I choose?\nIf you’re thi...
6,What the Heck is System Engineering?,"Oct 21, 2021",Codeup offers a 13-week training program: Syst...
7,From Speech Pathology to Business Intelligence,"Oct 18, 2021","By: Alicia Gonzalez\nBefore Codeup, I was a ho..."
8,Boris – Behind the Billboards,"Oct 3, 2021",
9,Is Codeup the Best Bootcamp in San Antonio…or ...,"Sep 16, 2021",Looking for the best data science bootcamp in ...


8. For each dataframe, produce the following columns:

 - `title` to hold the title
 - `original` to hold the original article/post content
 - `clean` to hold the normalized and tokenized original with the stopwords removed.
 - `stemmed` to hold the stemmed version of the cleaned data.
 - `lemmatized` to hold the lemmatized version of the cleaned data.

9. Ask yourself:

 - If your corpus is 493KB, would you prefer to use stemmed or lemmatized text? - 
 - If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
 - If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?