# Data-X Spring 2019: Homework 7

### Webscraping



In this homework, you will do some exercises with web-scraping.

## Name: Casey Chadwell

## SID: 3033291861



### Fun with Webscraping & Text manipulation


## 1. Statistics in Presidential Debates

Your first task is to scrape Presidential Debates from the Commission of Presidential Debates website: https://www.debates.org/voter-education/debate-transcripts/

To do this, you are not allowed to manually look up the URLs that you need, instead you have to scrape them. The root url to be scraped is the one listed above, namely: https://www.debates.org/voter-education/debate-transcripts/


1. By using `requests` and `BeautifulSoup` find all the links / URLs on the website that links to transcriptions of **First Presidential Debates** from the years [1988, 1984, 1976, 1960]. In total you should find 4 links / URLs that fulfill this criteria. **Print the urls.**

2. When you have a list of the URLs your task is to create a Data Frame with some statistics (see example of output below):
    1. Scrape the title of each link and use that as the column name in your Data Frame. 
    2. Count how long the transcript of the debate is (as in the number of characters in transcription string). Feel free to include `\` characters in your count, but remove any breakline characters, i.e. `\n`. You will get credit if your count is +/- 10% from our result.
    3. Count how many times the word **war** was used in the different debates. Note that you have to convert the text in a smart way (to not count the word **warranty** for example, but counting **war.**, **war!**, **war,** or **War** etc.
    4. Also scrape the most common used word in the debate, and write how many times it was used. Note that you have to use the same strategy as in C in order to do this.
    
    **Print your final output result.**
    
**Tips:**

___

In order to solve the questions above, it can be useful to work with Regular Expressions and explore methods on strings like `.strip(), .replace(), .find(), .count(), .lower()` etc. Both are very powerful tools to do string processing in Python. To count common words for example I used a `Counter` object and a Regular expression pattern for only words, see example:

```python
    from collections import Counter
    import re

    counts = Counter(re.findall(r"[\w']+", text.lower()))
```

Read more about Regular Expressions here: https://docs.python.org/3/howto/regex.html
    
    
**Example output of all of the answers to Question 1.2:**


![pres_stats_2](https://github.com/ikhlaqsidhu/data-x/raw/master/x-archive/misc/hw2_imgs_spring2018/presidents_stats_2.jpg)



----

.




In [143]:
import requests 
import bs4 as bs 
import pandas as pd
import numpy as np
import re
from collections import Counter

## 1.1: Links to First Presidential Debates

In [168]:
source = requests.get("https://www.debates.org/voter-education/debate-transcripts/") 
soup = bs.BeautifulSoup(source.content, features='html.parser') 
all_a = soup.find_all('a')

links_to_years = []
titles = []

for a in all_a:
    if re.search(r'(1988|1984|1976|1960): The First', a.contents[0]):
        link = a.get('href')
        links_to_years.append(link)
        titles.append(a.contents[0])
        
for link in links_to_years:
    print(re.findall(r'(1988|1984|1976|1960)', link)[0], ': https://www.debates.org' + link)

1988 : https://www.debates.org/voter-education/debate-transcripts/september-25-1988-debate-transcript/
1984 : https://www.debates.org/voter-education/debate-transcripts/october-7-1984-debate-transcript/
1976 : https://www.debates.org/voter-education/debate-transcripts/september-23-1976-debate-transcript/
1960 : https://www.debates.org/voter-education/debate-transcripts/september-26-1960-debate-transcript/


## 1.2: Data Frame and Statistics

##### A. titles as columns

In [169]:
# got titles above in for loop
titles

['September 25, 1988: The First Bush-Dukakis Presidential Debate',
 'October 7, 1984: The First Reagan-Mondale Presidential Debate',
 'September 23, 1976: The First Carter-Ford Presidential Debate',
 'September 26, 1960: The First Kennedy-Nixon Presidential Debate']

B. Count how long the transcript of the debate is (as in the number of characters in transcription string). Feel free to include \ characters in your count, but remove any breakline characters, i.e. \n. You will get credit if your count is +/- 10% from our result.



In [170]:
def get_char_count(url, div_id):
    source = requests.get(url)
    soup = bs.BeautifulSoup(source.content, features='html.parser')
    all_divs = soup.find(id = div_id).text
    return(len(str(all_divs).replace('\n', '')))

counts = []

for link in links_to_years:
    counts.append(get_char_count('https://www.debates.org' + link, 'content-sm'))

counts

[87488, 86505, 80735, 60937]

C. Count how many times the word war was used in the different debates. Note that you have to convert the text in a smart way (to not count the word warranty for example, but counting war., war!, war, or War etc.

In [172]:
def get_war_count(url, div_id):
    source = requests.get(url)
    soup = bs.BeautifulSoup(source.content, features='html.parser')
    all_divs = soup.find(id = div_id).text
    return(len(re.findall(' war ', re.sub(r"[^\w\s]+?", "", str(all_divs).lower()))))

war_counts = []

for link in links_to_years:
    war_counts.append(get_war_count('https://www.debates.org' + link, 'content-sm'))

war_counts

[7, 2, 5, 3]

D. Also scrape the most common used word in the debate, and write how many times it was used. Note that you have to use the same strategy as in C in order to do this.

In [173]:
def get_common_count(url, div_id):
    source = requests.get(url)
    soup = bs.BeautifulSoup(source.content, features='html.parser')
    all_divs = soup.find(id = div_id).text
    words = re.findall('\w+', re.sub(r"[^\w\s]+?", "", str(all_divs).lower()))
    Counter(words).most_common(10)
    return(Counter(words).most_common(1))
    

common_w = []
common_c = []

for link in links_to_years:
    common = get_common_count('https://www.debates.org' + link, 'content-sm')
    common_w.append(common[0][0])
    common_c.append(common[0][1])

print(common_w)
print(common_c)

['the', 'the', 'the', 'the']
[799, 867, 856, 779]


#### 1.2 End Solution: Data Frame

In [174]:
pd.DataFrame(data = [counts, war_counts, common_w, common_c], columns = titles, index = ['Debate char length', 'war_count', 'most_common_w', 'most_common_w_count'])

Unnamed: 0,"September 25, 1988: The First Bush-Dukakis Presidential Debate","October 7, 1984: The First Reagan-Mondale Presidential Debate","September 23, 1976: The First Carter-Ford Presidential Debate","September 26, 1960: The First Kennedy-Nixon Presidential Debate"
Debate char length,87488,86505,80735,60937
war_count,7,2,5,3
most_common_w,the,the,the,the
most_common_w_count,799,867,856,779


    
## 2. Download and read in specific line from many data sets

Scrape the first 27 data sets from this URL http://people.sc.fsu.edu/~jburkardt/datasets/regression/ (i.e.`x01.txt` - `x27.txt`). Then, save the 5th line in each data set, this should be the name of the data set author (get rid of the `#` symbol, the white spaces and the comma at the end). 

Count how many times (with a Python function) each author is the reference for one of the 27 data sets. Showcase your results, sorted, with the most common author name first and how many times he appeared in data sets. Use a Pandas DataFrame to show your results, see example. **Print your final output result.**

**Example output of the answer for Question 2:**

![author_stats](https://github.com/ikhlaqsidhu/data-x/raw/master/x-archive/misc/hw2_imgs_spring2018/data_authors.png)


In [183]:
base_link = "http://people.sc.fsu.edu/~jburkardt/datasets/regression/"
source = requests.get(base_link) 
soup = bs.BeautifulSoup(source.content, features='html.parser') 
all_a = soup.find_all('a')

links_to_years = []
titles = []

for a in all_a:
    if re.search('(x([0-1][0-9]|.txt)|(x2[0-7].txt)', a.contents[0]):
        lines = []
        link = a.get('href')
        with open (base_link + link, 'rt') as cur_file:  # Open file lorem.txt for reading of text data.
        for line in in_file: # Store each line in a string variable "line"
            
        
with open ('lorem.txt', 'rt') as in_file:  #Open file lorem.txt for reading of text data.
for line in in_file: #For each line of text store in a string variable named "line", and
lines.append(line)  #add that line to our list of lines.
print(lines)        #print the list object.
        re.
        links_to_years.append(link)
        titles.append(a.contents[0])
with open ('lorem.txt', 'rt') as in_file:  # Open file lorem.txt for reading of text data.
for line in in_file: # Store each line in a string variable "line"
print(line) # prints that line
    
print(titles)

?C=N;O=D
?C=M;O=A
?C=S;O=A
?C=D;O=A
/~jburkardt/datasets/
regression.html
x01.txt
x02.txt
x03.txt
x04.txt
x05.txt
x06.txt
x07.txt
x08.txt
x09.txt
x10.txt
x11.txt
x12.txt
x13.txt
x14.txt
x15.txt
x16.txt
x17.txt
x18.txt
x19.txt
x20.txt
x21.txt
x22.txt
x23.txt
x24.txt
x25.txt
x26.txt
x27.txt
x28.txt
x29.txt
x30.txt
x31.txt
x32.txt
x33.txt
x34.txt
x35.txt
x36.txt
x37.txt
x38.txt
x39.txt
x40.txt
x41.txt
x42.txt
x43.txt
x43_01.txt
x43_02.txt
x43_03.txt
x44.txt
x44_01.txt
x44_02.txt
x44_03.txt
x45.txt
x45_01.txt
x45_02.txt
x45_03.txt
x46.txt
x47.txt
x47_01.txt
x47_02.txt
x47_03.txt
x48.txt
x48_01.txt
x48_02.txt
x48_03.txt
x49.txt
x49_01.txt
x49_02.txt
x49_03.txt
x50.txt
x50_01.txt
x50_02.txt
x50_03.txt
x51.txt
x51_01.txt
x51_02.txt
x51_03.txt
x52.txt
x52_01.txt
x52_02.txt
x52_03.txt
x53.txt
x53_01.txt
x53_02.txt
x53_03.txt
x54.txt
x54_01.txt
x54_02.txt
x54_03.txt
x55.txt
x55_01.txt
x55_02.txt
x55_03.txt
x56.txt
x56_01.txt
x56_02.txt
x56_03.txt
x57.txt
x57_01.txt
x57_02.txt
x57_03.txt
x58.txt
