In [1]:
# Initialize Otter Grader
import otter
grader = otter.Notebook()

# Data-X Spring 2018: Homework 02

### Regression, Classification, Webscraping

**Authors:** 

Alexander Fred-Ojala

Ishaan Malhi

In this homework, you will do some exercises with web-scraping.

## Fun with Webscraping & Text manipulation


## 1. Statistics in Presidential Debates

Your first task is to scrape Presidential Debates from the Commission of Presidential Debates website: http://www.debates.org/index.php?page=debate-transcripts.

To do this, you are not allowed to manually look up the URLs that you need, instead you have to scrape them. The root url to be scraped is the one listed above, namely: http://www.debates.org/index.php?page=debate-transcripts


1. By using `requests` and `BeautifulSoup` find all the links / URLs on the website that links to transcriptions of **First Presidential Debates** from the years [2012, 2008, 2004, 2000, 1996, 1988, 1984, 1976, 1960]. In total you should find 9 links / URLs tat fulfill this criteria. Print the urls.
2. When you have a list of the URLs your task is to create a Data Frame with some statistics (see example of output below):
    1. Scrape the title of each link and use that as the column name in your Data Frame. 
    2. Count how long the transcript of the debate is (as in the number of characters in transcription string). Feel free to include `\` characters in your count, but remove any breakline characters, i.e. `\n`. You will get credit if your count is +/- 10% from our result.
    3. Count how many times the word **war** was used in the different debates. Note that you have to convert the text in a smart way (to not count the word **warranty** for example, but counting **war.**, **war!**, **war,** or **War** etc.
    4. Also scrape the most common used word in the debate, and write how many times it was used. Note that you have to use the same strategy as in 3 in order to do this.
    
    Print your results. Your output for 1 column should look something similar to the result below:
    
    <img src='./example_output_2_1.png' width="500">

**Tips:**

___

In order to solve questions above it can be useful to work with Regular Expressions and explore methods on strings like `.strip(), .replace(), .find(), .count(), .lower()` etc. Both are very powerful tools to do string processing in Python. To count common words for example we can use a `Counter` object and a Regular expression pattern for only words.

Read more about Regular Expressions here: https://docs.python.org/3/howto/regex.html

In [2]:
## Import libraries
import requests
import bs4 as bs
import pandas as pd

Q1. a) Write a command to request the data from debates.org. Store the content in the variable called `original`

<!--
BEGIN QUESTION
name: q1a
manual: true
hidden: false
points: 1
-->
<!-- EXPORT TO PDF -->

In [3]:
root_url = 'http://www.debates.org/index.php?page=debate-transcripts'
original = requests.get(root_url).content

In [4]:
grader.check("q1a")

Q1 b) Use the `original` variable to create a `BeautifulSoup` object.

<!--
BEGIN QUESTION
name: q1b
manual: true
hidden: false
points: 1
-->
<!-- EXPORT TO PDF -->

In [5]:
soup = bs.BeautifulSoup(original, features='html.parser')

In [6]:
grader.check("q1b")

In [7]:
## Let's look at what the scraped website looks like
soup.prettify

<bound method Tag.prettify of <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>
<head>
<meta content="en-us" http-equiv="Content-Language"/>
<meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
<title>CPD: Debate Transcripts</title>
<link href="/wp-content/themes/debates2019/css/reset.css" rel="stylesheet" type="text/css"/>
<link href="/wp-content/themes/debates2019/css/jc-main.css" media="screen,projection" rel="stylesheet" type="text/css"/>
<link href="/wp-content/themes/debates2019/css/fonts.css" media="screen,projection" rel="stylesheet" type="text/css"/>
<!--[if gte IE 5]>
        <link href="/wp-content/themes/debates2019/css/jc-iemain.css" rel="stylesheet" type="text/css" media="screen,projection"  />
        <![endif]-->
<link href="/wp-content/themes/debates2019/css/styles.css" media="screen" rel="stylesheet" type="text/css"/>
<style>
.page-item-44 .children {
    display: none;
}
</style>
</head>
<body>
<div id="wrapper">
<div id="header">

Q1 c) Extract all the `a` links that contain the words `First`, *except* the debates from 1992. Store the results in the array called `urls`

<!--
BEGIN QUESTION
name: q1c
manual: true
hidden: false
points: 1
-->
<!-- EXPORT TO PDF -->

In [8]:
urls=[]
cols = []
for i in soup.find_all('a'):
    text = i.text
    if ('First' in text) & ~('1992' in text) & ~('First half' in text):
        urls.append(i.get('href'))
        cols.append(text)

urls # print the urls

['/voter-education/debate-transcripts/september-26-2016-debate-transcript/',
 '/voter-education/debate-transcripts/october-3-2012-debate-transcript/',
 '/voter-education/debate-transcripts/2008-debate-transcript/',
 '/voter-education/debate-transcripts/september-30-2004-debate-transcript/',
 '/voter-education/debate-transcripts/october-3-2000-transcript/',
 '/voter-education/debate-transcripts/october-6-1996-debate-transcript/',
 '/voter-education/debate-transcripts/september-25-1988-debate-transcript/',
 '/voter-education/debate-transcripts/october-7-1984-debate-transcript/',
 '/voter-education/debate-transcripts/september-23-1976-debate-transcript/',
 '/voter-education/debate-transcripts/september-26-1960-debate-transcript/']

In [9]:
grader.check("q1c")

Q1. d) Create an NaN valued dataframe and assign it to a variable `df0`. Set only the columns and the index for this dataframe. Your output should look similar to the result below. Note that there should be more columns in your result

<img src="./example_output_1_d.png">

<!--
BEGIN QUESTION
name: q1d
manual: true
hidden: false
points: 3
-->
<!-- EXPORT TO PDF -->

In [10]:
cols
index_values = ['Debate char length', 'war_count','most_common_w', 'most_common_w_count']
print(cols)
df0 = pd.DataFrame(index = index_values, columns = cols)
df0

['September 26, 2016: The First Clinton-Trump Presidential Debate', 'October 3, 2012: The First Obama-Romney Presidential Debate', 'September 26, 2008: The First McCain-Obama Presidential Debate', 'September 30, 2004: The First Bush-Kerry Presidential Debate', 'October 3, 2000: The First Gore-Bush Presidential Debate', 'October 6, 1996: The First Clinton-Dole Presidential Debate', 'September 25, 1988: The First Bush-Dukakis Presidential Debate', 'October 7, 1984: The First Reagan-Mondale Presidential Debate', 'September 23, 1976: The First Carter-Ford Presidential Debate', 'September 26, 1960: The First Kennedy-Nixon Presidential Debate']


Unnamed: 0,"September 26, 2016: The First Clinton-Trump Presidential Debate","October 3, 2012: The First Obama-Romney Presidential Debate","September 26, 2008: The First McCain-Obama Presidential Debate","September 30, 2004: The First Bush-Kerry Presidential Debate","October 3, 2000: The First Gore-Bush Presidential Debate","October 6, 1996: The First Clinton-Dole Presidential Debate","September 25, 1988: The First Bush-Dukakis Presidential Debate","October 7, 1984: The First Reagan-Mondale Presidential Debate","September 23, 1976: The First Carter-Ford Presidential Debate","September 26, 1960: The First Kennedy-Nixon Presidential Debate"
Debate char length,,,,,,,,,,
war_count,,,,,,,,,,
most_common_w,,,,,,,,,,
most_common_w_count,,,,,,,,,,


In [11]:
grader.check("q1d")

Q1. e) Now create the dataframe by going to every link that you captured in the variable `urls`.

<!--
BEGIN QUESTION
name: q1e
manual: true
hidden: false
points: 4
-->
<!-- EXPORT TO PDF -->

In [12]:
import re
from collections import Counter

def processPage(url):
    
    source = requests.get(url)
    soup = bs.BeautifulSoup(source.content, features='html.parser')


    #Find number of characters
    numChar = len(re.sub(r'\n', '',  soup.text))

    # Find number of war
    war = re.findall(r'war/W', soup.text)
    numWar = len(war)

    # Find most common word
    words = re.findall(r'[a-zA-z]+', soup.text.lower())

    counter = Counter(words) 

    # most_common() produces k frequently encountered 
    # input values and their respective counts. 
    most_occur = counter.most_common(1) 
    mostCommonWord = most_occur[0][0]
    mostCommonWordCount = most_occur[0][1]
    return [numChar, numWar, mostCommonWord, mostCommonWordCount]

#print(numChar, )

In [13]:
for idx, url in enumerate(urls):
    fullURL = "https://www.debates.org/" + urls[idx]
    df0[str(cols[idx])] = processPage(fullURL)

## Print df
df0

Unnamed: 0,"September 26, 2016: The First Clinton-Trump Presidential Debate","October 3, 2012: The First Obama-Romney Presidential Debate","September 26, 2008: The First McCain-Obama Presidential Debate","September 30, 2004: The First Bush-Kerry Presidential Debate","October 3, 2000: The First Gore-Bush Presidential Debate","October 6, 1996: The First Clinton-Dole Presidential Debate","September 25, 1988: The First Bush-Dukakis Presidential Debate","October 7, 1984: The First Reagan-Mondale Presidential Debate","September 23, 1976: The First Carter-Ford Presidential Debate","September 26, 1960: The First Kennedy-Nixon Presidential Debate"
Debate char length,96276,96389,184061,84725,93102,95153,89571,88585,82818,63020
war_count,0,0,0,0,0,0,0,0,0,0
most_common_w,the,the,the,the,the,the,the,the,the,the
most_common_w_count,645,758,1473,859,921,878,805,868,858,780


In [14]:
grader.check("q1e")

    
## 2. Download and read in specific line from many data sets

Scrape the first 27 data sets from this URL http://people.sc.fsu.edu/~jburkardt/datasets/regression/ (i.e.`x01.txt` - `x27.txt`). Then, save the 5th line in each data set, this should be the name of the data set author (get rid of the `#` symbol, the white spaces and the comma at the end). 

Count how many times (with a Python function) each author is the reference for one of the 27 data sets. Showcase your results, sorted, with the most common author name first and how many times he appeared in data sets. Use a Pandas DataFrame to show your results, see example.
Print your results.

**Example output of the answer to Question 2:**

![author_stats](https://github.com/ikhlaqsidhu/data-x/raw/master/x-archive/misc/hw2_imgs_spring2018/data_authors.png)

In [15]:
data_names = list()
# Create a list of filenames
for i in range(1,28):
    data_names.append('x'+str(i).zfill(2)+'.txt')

In [16]:
data_names

['x01.txt',
 'x02.txt',
 'x03.txt',
 'x04.txt',
 'x05.txt',
 'x06.txt',
 'x07.txt',
 'x08.txt',
 'x09.txt',
 'x10.txt',
 'x11.txt',
 'x12.txt',
 'x13.txt',
 'x14.txt',
 'x15.txt',
 'x16.txt',
 'x17.txt',
 'x18.txt',
 'x19.txt',
 'x20.txt',
 'x21.txt',
 'x22.txt',
 'x23.txt',
 'x24.txt',
 'x25.txt',
 'x26.txt',
 'x27.txt']

Q2. a) Now that we have the filenames, scrape and pick the *5th line* in each data set.

<!--
BEGIN QUESTION
name: q2a
manual: true
hidden: false
points: 1
-->
<!-- EXPORT TO PDF -->

In [17]:
import re
count = dict()
authors = []

for n in data_names:
    content = requests.get("https://people.sc.fsu.edu/~jburkardt/datasets/regression/"+ n).content
    soup = bs.BeautifulSoup(content,features='lxml')
    authorRow = re.findall(r'\n#.+,\n#', soup.text)[0]
    authorName = re.search(r'[A-Z]+.*,', authorRow)[0][:-1]
    if ' and ' in authorName:
        authorNames = re.split(r' and ', authorName)
        authors.extend(authorNames)
    elif ', ' in authorName:
        authorNames = re.split(r', ', authorName)
        authors.extend(authorNames)
    else:
        authors.append(authorName)

for a in authors:
    if a in count: 
        # Increment count of word by 1 
        count[a] = count[a] + 1
    else: 
        # Add the word to dictionary with count 1 
        count[a] = 1

In [18]:
grader.check("q2a")

Q2. b) Create a dataframe with this variable `count`.

<!--
BEGIN QUESTION
name: q2b
manual: true
hidden: false
points: 1
-->
<!-- EXPORT TO PDF -->

In [19]:
df= pd.DataFrame.from_dict(count, orient='index' )
df.index.name = "Authors"
df.rename(columns = {0: "Counts"}, inplace = True)
df


Unnamed: 0_level_0,Counts
Authors,Unnamed: 1_level_1
Helmut Spaeth,16
R J Freund,2
P D Minton,2
D G Kleinbaum,2
L L Kupper,2
K A Brownlee,1
S Chatterjee,4
B Price,4
S C Narula,2
J F Wellington,2


In [20]:
grader.check("q2b")

Q2. c) Set the index of this dataframe `df` to `Authors`
<!--
BEGIN QUESTION
name: q2c
manual: true
hidden: false
points: 1
-->
<!-- EXPORT TO PDF -->

In [21]:
#Done already
df

Unnamed: 0_level_0,Counts
Authors,Unnamed: 1_level_1
Helmut Spaeth,16
R J Freund,2
P D Minton,2
D G Kleinbaum,2
L L Kupper,2
K A Brownlee,1
S Chatterjee,4
B Price,4
S C Narula,2
J F Wellington,2


In [22]:
grader.check("q2c")

Q2. d) Sort the `Counts` column in descending order. Make sure your dataframe `df` is modified.
<!--
BEGIN QUESTION
name: q2d
manual: true
hidden: false
points: 1
-->
<!-- EXPORT TO PDF -->

In [23]:
df.sort_values("Counts", inplace = True, ascending = False)
display(df)
print(df.index)

Unnamed: 0_level_0,Counts
Authors,Unnamed: 1_level_1
Helmut Spaeth,16
S Chatterjee,4
B Price,4
R J Freund,2
P D Minton,2
D G Kleinbaum,2
L L Kupper,2
S C Narula,2
J F Wellington,2
K A Brownlee,1


Index(['Helmut Spaeth', 'S Chatterjee', 'B Price', 'R J Freund', 'P D Minton',
       'D G Kleinbaum', 'L L Kupper', 'S C Narula', 'J F Wellington',
       'K A Brownlee'],
      dtype='object', name='Authors')


In [24]:
grader.check("q2d")

Congratulations! Please select `Kernel -> Restart and Run All` and submit this `.ipynb` file to gradescope. 

*Important: Please submit this notebook, not the pdf file.*