<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Week 5 Review - Solutions

 _**Author:** Noelle B. (DSI-DEN)_

---
We will review the learning objectives of each lesson this week and answer questions related to them.

---
## 5.01 Intro to HTML

### Create a HTML document from scratch & Describe the most common HTML tags and their usage

**Q1.** What does this tag mean: `<p></p>`

> **Answer:**  
Paragraph tag, signifies the start of a paragraph of text or other elements.

**Q2.** What does this tag mean: `<ul></ul>`

> **Answer:**  
Unordered list, creates a bulleted list.

**Q3.** What does this tag mean: `<th></th>`

> **Answer:**  
Table header, specifies the header row in a table.

**Q4.** What does this tag mean: `<body></body>`

> **Answer:**  
Body tag, specifies the body of the webpage.

### Explain how CSS can be used to modify the display of HTML

**Q5.** What does `CSS` stand for? What does `HTML` stand for?

> **Answer:**  
CSS: Cascading Style Sheets  
HTML: Hypertext Markup Language

**Q6.** What is the difference between CSS and HTML?

> **Answer:**  
CSS describes how HTML elements are displayed on the webpage.

---
## 5.02 BeautifulSoup

### Define Webscraping

**Q7.** Define Webscraping.

> **Answer:**  
Webscraping is the process of extracting data from a website using code.

### Use the requests library

**Q8.** Use the requests library to create a request for the following URL. What status code do you get and what does this mean?

In [2]:
import requests
url = 'https://www.imdb.com/title/tt0241527/'

# Answer:
res = requests.get(url)
res.status_code

200

> **Answer:**  
A status code of 200 means the request was successful.

### Create a Beautiful Soup object & find soup objects

**Q9.** Using the following request, create a beautiful soup object and find all table objects.

In [6]:
from bs4 import BeautifulSoup
import requests

url = 'https://www.imdb.com/title/tt0241527/'
res = requests.get(url)

# Answer:
soup = BeautifulSoup(res.content, 'lxml')
table = soup.find_all('table')

###  Create a pandas dataframe from a scrape

**Q10.** Using the following table scraped from IMDb, create a dataframe of actor name and character name.

In [10]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

url = 'https://www.imdb.com/title/tt0241527/'
res = requests.get(url)

soup = BeautifulSoup(res.content, 'lxml')
table = soup.find('table', {'class': 'cast_list'})

In [17]:
# Answer:
cast_list = []

for row in table.find_all('tr')[1:]:
    cast = {}
    cast['name'] = row.find_all('a')[1].text.strip()
    cast['character'] = row.find('td', {'class': 'character'}).text.strip()
    
    cast_list.append(cast)
    
pd.DataFrame(cast_list)

Unnamed: 0,name,character
0,Richard Harris,Albus Dumbledore
1,Maggie Smith,Professor McGonagall
2,Robbie Coltrane,Hagrid
3,Saunders Triplets,Baby Harry Potter
4,Daniel Radcliffe,Harry Potter
5,Fiona Shaw,Aunt Petunia Dursley
6,Harry Melling,Dudley Dursley
7,Richard Griffiths,Uncle Vernon Dursley
8,Derek Deadman,Bartender in Leaky Cauldron
9,Ian Hart,Professor Quirrell


---
## 5.03 NLP I

### Define and implement tokenizing, lemmatizing, and stemming.

**Q11.** Define `tokenizing`.

> **Answer:** 
Tokenizing is the process of splitting text into smaller chunks/tokens.

**Q12.** Define `lemmatizing`.

>**Answer:**  
Lemmatizing is the process of getting words to their 'lemma,' or their root form that is a dictionary word.

**Q13.** Define `stemming`.

> **Answer:**  
Stemming is the process of returning a root/base of a word.

### Describe what RegEx does.

**Q14.** What is RegEx and what does it do?

> **Answer:**  
Regular expressions allow you to segment text by defining a search pattern.

### Apply sentiment analysis.

**Q15.** Apply a simple sentiment analysis using the following positive and negative words on the Google review.

In [23]:
positive_words = ['delight', 'good', 'great', 'awesome', 'tremendous', 'fabulous', 'amazing', 'stellar', 'best', 'love']
negative_words = ['garbage', 'sad', 'trash', 'ugly', 'bad', 'disgusting', 'terrible', 'gross', 'worst', 'awful']

review = 'Waited half an hour for my food. Normally the team is strong but i noticed the manager barely helping hiding in the back while over 20 people stood waiting for food while they continued to take orders. Doordash and the delivery services didnt help as there were massive orders backing the place up. Worst experience ever at a fast food place.'

In [24]:
# Answer
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer

# Code from global lesson 5.03
def simple_sentiment(text):
    # Instantiate tokenizer.
    tokenizer = RegexpTokenizer(r'\w+')
    
    # Tokenize text.
    tokens = tokenizer.tokenize(text.lower())
    
    # Instantiate stemmer.
    p_stemmer = PorterStemmer()
    
    # Stem words.
    stemmed_words = [p_stemmer.stem(i) for i in tokens]
    
    # Stem our positive/negative words.
    positive_stems = [p_stemmer.stem(i) for i in positive_words]
    negative_stems = [p_stemmer.stem(i) for i in negative_words]

    # Count "positive" words.
    positive_count = sum([1 for i in stemmed_words if i in positive_stems])
    
    # Count "negative" words
    negative_count = sum([1 for i in stemmed_words if i in negative_stems])
    
    # Calculate Sentiment Percentage 
    # (Positive Count - Negative Count) / (Total Count)

    return round((positive_count - negative_count) / len(tokens), 2)

In [25]:
simple_sentiment(review)

-0.02

### Preprocess text data.

**Q16.** What are some steps that you should take to pre-process text data?

>**Answer:**   
- Tokenize
- Lemmatize or stem
- Remove stop words

---
## 5.04 NLP II

### Extract features from unstructured text by fitting and transforming with CountVectorizer and TfidfVectorizer.

**Q17.** Vectorize the following data using CountVectorizer.

In [28]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

tweets = pd.read_csv('./data/trump_tweets.csv')
tweets = tweets['text']

In [34]:
# Answer
# code from global lesson 5.04
cvec = CountVectorizer()
cvec.fit(tweets)
cv_tweets = cvec.transform(tweets)
cv_tweets_df = pd.DataFrame(cv_tweets.toarray(),
                          columns=cvec.get_feature_names())
cv_tweets_df.head()

Unnamed: 0,00,000,00pm,05c14uxy2b,09bac03rx6,0cyrjl1yoj,0dtxsagkz1,0zhugiepot,10,100,...,zcqc7nzmme,zdj3lyo76h,zelensky,zlldo4gtij,zone,zpqcjgc4xh,zqhadcikqg,zryfqxatin,ztcsrcqjcn,zucker
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Q18.** Vectorize the following data using TfidfVectorizer.

In [36]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

tweets = pd.read_csv('./data/trump_tweets.csv')
tweets = tweets['text']

In [40]:
# Answer
# code from global lesson 5.04
tfidf = TfidfVectorizer()
tfidf.fit(tweets)
tf_tweets = tfidf.transform(tweets)
tf_tweets_df = pd.DataFrame(tf_tweets.toarray(),
                          columns=tfidf.get_feature_names())
tf_tweets_df.head()

Unnamed: 0,00,000,00pm,05c14uxy2b,09bac03rx6,0cyrjl1yoj,0dtxsagkz1,0zhugiepot,10,100,...,zcqc7nzmme,zdj3lyo76h,zelensky,zlldo4gtij,zone,zpqcjgc4xh,zqhadcikqg,zryfqxatin,ztcsrcqjcn,zucker
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Describe how CountVectorizers and TF-IDFVectorizers work.

**Q19.** How do `TF-IDF Vectorizers` work?

>**Answer:**  
TF-IDF (term frequency - inverse document frequency) Vectorizer "assigns each word in a document a number that is proportional to its frequency in the document and inversely proportional to the number of documents in which it occurs. Very common words, such as “a” or “the”, thereby receive heavily discounted tf-idf scores, in contrast to words that are very specific to the document in question. The result is a matrix of tf-idf scores with one row per document and as many columns as there are different words in the dataset." [source](https://buhrmann.github.io/tfidf-analysis.html)

**Q20.** How do `Count Vectorizers` work?

> **Answer:**  
Count Vectorizers work by returning an integer count of how often each word appears in the document.

### Understand stop_words, max_features, min_df, max_df, and ngram_range.

**Q21.** What are `stop words`?

>**Answer:**  
Stop words are words that do not add anything to our analysis and we often want to remove. Examples include 'a', 'the', 'and', etc.

**Q22.** What does adjusting `max_features` in CountVectorizer do?

> **Answer:**  
This restricts the maximum number of features (only top n most popular words) to be used. This solves the problem of having extremely large results.

**Q23.** What does adjusting `min_df` in CountVectorizer do?

> **Answer:**  
This sets the minimum number of documents that a word must occur in to be included as a feature.

**Q24.** What does adjusting `max_df` in CountVectorizer do?

> **Answer:**  
This only considers words that occur in at most some percentage of documents to be used as features.

**Q25.** What does adjusting `ngram_range` in CountVectorizer do?

> **Answer:**  
This determines what $n$-grams should be considered as features. For example, if you set ngram_range=(1,2) this captures both single words and pairs of words.

### Implement CountVectorizer and TfidfVectorizer in a spam classification model.

**Q26.** Use CountVectorizer along with a Logistic Regression model to predict whether the retweet count will be greater than 18,000.

In [73]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

tweets = pd.read_csv('./data/trump_tweets.csv')
tweets['target'] = np.where(tweets['retweet_count'] > 18000, 1, 0)

X = tweets['text']
y = tweets['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=13)

In [74]:
# Answer
cvec = CountVectorizer()
X_train = cvec.fit_transform(X_train)
X_test = cvec.transform(X_test)

lr = LogisticRegression(solver='lbfgs')
lr.fit(X_train, y_train)

print(f'Training Score: {lr.score(X_train, y_train)}')
print(f'Testing Score: {lr.score(X_test, y_test)}')

Training Score: 0.9812734082397003
Testing Score: 0.6555555555555556


**Q27.** Use TfidfVectorizer along with a Logistic Regression model to predict whether the retweet count will be greater than 18,000.

In [75]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

tweets = pd.read_csv('./data/trump_tweets.csv')
tweets['target'] = np.where(tweets['retweet_count'] > 18000, 1, 0)

X = tweets['text']
y = tweets['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=13)

In [76]:
# Answer
tvec = TfidfVectorizer()
X_train = tvec.fit_transform(X_train)
X_test = tvec.transform(X_test)

lr = LogisticRegression(solver='lbfgs')
lr.fit(X_train, y_train)

print(f'Training Score: {lr.score(X_train, y_train)}')
print(f'Testing Score: {lr.score(X_test, y_test)}')

Training Score: 0.9588014981273408
Testing Score: 0.6222222222222222


### Use GridSearchCV and Pipeline with CountVectorizer.

**Q28.** Use CountVectorizer along with a Logistic Regression model in a pipeline to predict whether the retweet count will be greater than 18,000. Search over the following values of hyperparameters:
- Maximum number of features fit: 3000, 4000
- Minimum number of documents needed to include token: 2, 3, 4
- Maximum number of documents needed to include token: 85%, 90%, 95%
- Check (individual tokens) and also check (individual tokens and 2-grams).

In [78]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

tweets = pd.read_csv('./data/trump_tweets.csv')
tweets['target'] = np.where(tweets['retweet_count'] > 18000, 1, 0)

X = tweets['text']
y = tweets['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=13)

In [79]:
# Answer
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression(solver = 'lbfgs'))
])

pipe_params = {
    'cvec__max_features': [3_000, 4_000],
    'cvec__min_df': [2, 3, 4],
    'cvec__max_df': [.85, .9, .95],
    'cvec__ngram_range': [(1,1), (1,2)]
}

gs = GridSearchCV(pipe, param_grid=pipe_params, cv=5)

gs.fit(X_train, y_train)

print(gs.best_score_)

gs_model = gs.best_estimator_

print(f'Training Score: {gs_model.score(X_train, y_train)}')
print(f'Testing Score: {gs_model.score(X_test, y_test)}')

0.6591760299625468
Training Score: 0.9662921348314607
Testing Score: 0.6222222222222222




---
## 5.05 API Integration & Consumption

### Explain what an application program interface (API) is.

**Q29.** What is an API?

> **Answer:**  
An application programming interface (API) is broadly a pattern of programming that allows a programmer to solve a task.

### Explain the very basics of HTTP and how it's used to get data from the web.

**Q30.** Broadly speaking, what is HTTP and how is it used?

> **Answer:**  
HTTP stands for hyptertext transfer protocol. It is a set of rules for passing data around the web.

### Use the requests library to submit HTTP requests from Python.

**Q31.** Submit an HTTP request to get information about a Pokemon of your choice.

In [8]:
base_url = "https://pokeapi.co/api/v2/pokemon/"

In [9]:
# Answer:
import requests

# Make request
res = requests.get(base_url + "eevee")
res.json()['moves'][0]['move']

{'name': 'sand-attack', 'url': 'https://pokeapi.co/api/v2/move/28/'}

### Use a free Python API to access real-time stock prices.

**Q32.** Using the free [open-notify API](http://api.open-notify.org/), get the number of people in space right now.

*Note: API found from this [source](https://www.dataquest.io/blog/python-api-tutorial/).*

In [10]:
url = 'http://api.open-notify.org/astros.json'

# Answer:
import requests

res = requests.get(url)
res.json()

{'people': [{'name': 'Andrew Morgan', 'craft': 'ISS'},
  {'name': 'Oleg Skripochka', 'craft': 'ISS'},
  {'name': 'Jessica Meir', 'craft': 'ISS'},
  {'name': 'Chris Cassidy', 'craft': 'ISS'},
  {'name': 'Anatoly Ivanishin', 'craft': 'ISS'},
  {'name': 'Ivan Vagner', 'craft': 'ISS'}],
 'message': 'success',
 'number': 6}

---
## 5.06 Introduction to AWS

### Properly define cloud computing

**Q33.** What is cloud computing?

> **Answer:**  
Cloud computing is computing on a server that is maintained by someone else.

### Explain and identify the need for cloud computing

**Q34.** Why do we need cloud computing?

> **Answer:**  
Powerful computers are expensive and maintaining servers is cumbersome. Cloud computing services allow us to access servers that would be way to expensive if we bought them on our own.

### Connect to a remote server from the command line

**Q35.** What is the protocol we used to connect to our server from the command line? Bonus: what is the command that we used to do this?

> **Answer:**  
SSH: Secure shell protocol  
Bonus: *ssh -i /path/to/key.pem ubuntu.ec2-ip-address.amazonaws.com*

### Instantiate and configure a remote server

**Q36.** True/False: it is okay to put your code on GitHub that has your security key on it.

> **Answer:**  
False! This is a terrible idea and will typically get flagged by AWS/GitHub pretty quickly telling you to take it down. Someone else could use this key and it could cost you a lot of money. You can read someone's personal account of this [here](https://www.dannyguo.com/blog/i-published-my-aws-secret-key-to-github/).

### Post and retrieve files to and from a remote server

**Q37.** What should you always do when you are done working on a remote server and why?

> **Answer:**  
Terminate your instance so you don't accidentally get charged a bunch of money.