## Data Gathering

### Importing a Flat File with Pandas

In [5]:
import pandas as pd 

# read in file to DataFrame. We're reading in a TSV (Tab separated value)
# but you can still use the read_csv function to do so
filepath = r"C:\Users\Dan\Documents\Work\Python\Learning Material\4. Udacity - NanoDegree\1. Data Wrangling\DataSets\bestofrt.tsv"
df = pd.read_csv(filepath, sep='\t')

# view first 5 lines 
df.head() 

Unnamed: 0,ranking,critic_score,title,number_of_critic_ratings
0,1,99,The Wizard of Oz (1939),110
1,2,100,Citizen Kane (1941),75
2,3,100,The Third Man (1949),77
3,4,99,Get Out (2017),282
4,5,97,Mad Max: Fury Road (2015),370


### Web Scraping
Do not need to run original section, data actually provided via zip file from Udacity

In [None]:
# Accessing web page HTML data 
import requests 
url = r"https://www.rottentomatoes.com/m/et_the_extraterrestrial" 
response = requests.get(url) 

In [None]:
# Method 1 - Save HTML to file 
with open("et_the_extraterrestrial.html", mode='wb') as file:
    file.write(response.content) 

In [None]:
# Method 2 - Don't save a file, instead work with contents directly within computer memory 
from bs4 import BeautifulSoup   # Library for collecting HTML data online 
soup = BeautifulSoup(response.content)

- NB for the above, do not actually run. We have been given the source data by Udacity in a downloaded zip file

#### Lets run an example

In [1]:
from bs4 import BeautifulSoup 
import os 
import pandas as pd

#### Step by Step with break checks
- example of how to do a step by step with break checks to test code & output, building towards final code

In [3]:
df_list = []
folder = r"C:\Users\Dan\Documents\Work\Python\Learning Material\4. Udacity - NanoDegree\1. Data Wrangling\DataSets\rt-html\rt_html"

for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file:
        soup = BeautifulSoup(file, 'lxml')
        title = soup.find('title')
        print(title)
        break 
        
# running the above foe loop gives:
# <title>12 Angry Men (Twelve Angry Men) (1957) - Rotten Tomatoes</title>

<title>12 Angry Men (Twelve Angry Men) (1957) - Rotten Tomatoes</title>


In [5]:
# so, we access the contents of this data using .contents
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file:
        soup = BeautifulSoup(file, 'lxml')
        title = soup.find('title').contents[0]
        print(title)
        break

# which creates the output: 
# '12 Angry Men (Twelve Angry Men) (1957) - Rotten Tomatoes'

12 Angry Men (Twelve Angry Men) (1957) - Rotten Tomatoes


In [6]:
# so now, we need to add code that gets rid of the '- Rotten Tomatoes' at the end of the line
# we can do a reverse substring by using minus the length of the text we wish to get rid of so we have
# [0 : <calculated string length>] as our index finder so we know what string chars to access
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file:
        soup = BeautifulSoup(file, 'lxml')
        title = soup.find('title').contents[0][0:-len(' - Rotten Tomatoes')] 
        print(title)
        break
        
# and we get our answer as follows:
# 12 Angry Men (Twelve Angry Men) (1957)

12 Angry Men (Twelve Angry Men) (1957)


#### Step by Step to finding the number of audience ratings

In [7]:
# Finding the number of audience ratings
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file:
        soup = BeautifulSoup(file, 'lxml')
        num_audience_ratings = soup.find('div', class_='audience-info hidden-xs superPageFontColor')
        print(num_audience_ratings)
        break

<div class="audience-info hidden-xs superPageFontColor">
<div>
<span class="subtle superPageFontColor">Average Rating:</span>
            4.2/5
                </div>
<div>
<span class="subtle superPageFontColor">User Ratings:</span>
        103,672</div>
</div>


we can see from the output above, that within our <div> class_='audience-info hidden-xs superPageFontColor' there are TWO <span> groups. Thus, we would use the .find_all() method to access both, then use index=1 to access the second of the two

In [8]:
# Finding the number of audience ratings
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file:
        soup = BeautifulSoup(file, 'lxml')
        num_audience_ratings = soup.find('div', class_='audience-info hidden-xs superPageFontColor')
        num_audience_ratings = num_audience_ratings.find_all('div')[1].contents
        print(num_audience_ratings)
        break

['\n', <span class="subtle superPageFontColor">User Ratings:</span>, '\n        103,672']


Now, from the print out above, you see we accessed the second <span> from original output, which contains the number of reviews

In [10]:
# so now, we want to take the number, 103,672, which is the third item (index=2) in the list shown above
# so use the [2] after the .contents to grab the 3rd item as index=2
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file:
        soup = BeautifulSoup(file, 'lxml')
        num_audience_ratings = soup.find('div', class_='audience-info hidden-xs superPageFontColor')
        num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2]
        print(num_audience_ratings)
        break


        103,672


In [12]:
# you'll notice leading white space in front of number, and that the number is actually text atm.
# we can use the strip function to remove the white space
# we can then use the replace function, so swap out any commas with an empty string
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file:
        soup = BeautifulSoup(file, 'lxml')
        num_audience_ratings = soup.find('div', class_='audience-info hidden-xs superPageFontColor')
        num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2].strip().replace(',','')
        print(num_audience_ratings)
        break

103672


#### Full solution with notes

In [13]:
# The goal is to create a list of dictionaries to build file by file and later convert to DataFrame

# step 1 - create an empty list 
df_list = []

# step 2 - set the folder path where the HTML files are all saved 
folder = r"C:\Users\Dan\Documents\Work\Python\Learning Material\4. Udacity - NanoDegree\1. Data Wrangling\DataSets\rt-html\rt_html"

# step 3 - build a loop that passes over each HTML file, within the folder specified 
for movie_html in os.listdir(folder):
    
    # step 4 - take the folder path, and the current file name from this iteration of the loop 
    # and set together into variable 'file'
    with open(os.path.join(folder, movie_html)) as file:
        
        # step 5 - use the BeautifulSoup function to read the HTML file we are iterating over in the folder  
        # use function: BeautifulSoup(<file_name>, <parser_type>) 
        soup = BeautifulSoup(file, 'lxml') 

        # step 6 - use soup.find function to search for the movie 'title' data first 
        # then use .contents to grab the contents of that section 
        # then grabbing first item [index=0] & slicing off that last " - Rotten Tomatoes" text
        title = soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
        
        # step 7 - now we want to find audience score. If we look at the HTML file, the %value exists between the 
        # first <span> element on the file. We can thus use the .find method to gather this data
        audience_score = soup.find('div', class_='audience-score meter').find('span').contents[0][:-1]
        # so what is the above line doing?
        # we start by searching 'finding' a <div> element, with 'class'='audience-score meter'  
        # NB. because class is a Python keyword, we underscore after it to stop keyword action 
        # we then use .find again, inside the data we just found, to search for the <span> element, as we saw this is 
        # where audince score data sat, & there was only one <span> within this <div>
        # then use .contents to access that data, take at index 0 because its the only data there
        # this would give 90% for example, as a string. Well we don't want the % sign, so the final step grabs everything
        # but the last char in the string
        
        # step 8 - identify the number of audience ratings
        # see notes in step by step guide above for explanation on how to build out
        num_audience_ratings = soup.find('div', class_='audience-info hidden-xs superPageFontColor')
        num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2].strip().replace(',','')
        
        
        # step 9 - Append to list of dictionaries 
        # for each of the variables we have just built above, append them into the empty list we created at the start
        # and turn to integers where required
        # This is the final step of the loop, before it starts back over on the next file 
        df_list.append({'title': title,
                        'audience_score': int(audience_score),
                        'number_of_audience_ratings': int(num_audience_ratings)
                       })

# Step 10 
# Now that we have successfully looped through all HTML files in our folder, we want to our list of dictionaires to a dataframe
# This can be done using pandas like so:
df = pd.DataFrame(df_list, columns = ['title', 'audience_score', 'number_of_audience_ratings']) 
        

The above code may take a little time to run (c. 10 seconds)
Once complete, lets take a look at the DataFrame we've just built, below

In [14]:
df.head()

Unnamed: 0,title,audience_score,number_of_audience_ratings
0,12 Angry Men (Twelve Angry Men) (1957),97,103672
1,The 39 Steps (1935),86,23647
2,The Adventures of Robin Hood (1938),89,33584
3,All About Eve (1950),94,44564
4,All Quiet on the Western Front (1930),89,17768


And as you can see above, a dataframe with the required info has now been built

### Downloading Files from the Internet

#### HTTP (Hypertext Transfer Protocol)
HTTP, the Hypertext Transfer Protocol, is the language that web browsers (like Chrome or Safari) and web servers (basically computers where the contents of a website are stored) speak to each other. Every time you open a web page, or download a file, or watch a video, it's HTTP that makes it possible.

HTTP is a request/response protocol:

- Your computer, a.k.a. the client, sends a request to a server for some file. For this lesson: "Get me the file 1-the-wizard-of-oz-1939-film.txt", for example. GET is the name of the HTTP request method (of which there are multiple) used for retrieving data

- The web server sends back a response. If the request is valid: "Here is the file you asked for:", then followed by the contents of the 1-the-wizard-of-oz-1939-film.txt file itself

In [1]:
# we can use the requests library in Python to collect files from the internet 
import requests
import os 

In [6]:
# run a check for an existing directory to load files into, if it doesnt exist, build it 
folder_name = 'ebert_reviews'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

# now check the file path this program point at
# "C:\Users\Dan\Documents\Work\Python\Learning Material\4. Udacity - NanoDegree\1. Data Wrangling\" 
# Inside of this file path, you will now see a sub-folder called "ebert_reviews"

In [9]:
# As an example, lets look at the E.T Film text file; contains info about the film
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_11-e.t.-the-extra-terrestrial/11-e.t.-the-extra-terrestrial.txt'
response = requests.get(url) 
print(response)  # we should see <Response [200]> printed below for this example

<Response [200]>


In [10]:
# if we print the contents of response, we should see a body of text that exists within the file 
print(response.content)

b'E.T. The Extra-Terrestrial (1982)\nhttp://www.rogerebert.com/reviews/great-movie-et-the-extra-terrestrial-1982\nDear Raven and Emil:\n\nSunday we sat on the big green couch and watched "E.T. The Extra-Terrestrial" together with your mommy and daddy. It was the first time either of you had seen it, although you knew a little of what to expect because we took the "E.T." ride together at the Universal tour. I had seen the movie lots of times since it came out in 1982, so I kept one eye on the screen and the other on the two of you. I wanted to see how a boy on his fourth birthday, and a girl who had just turned 7 a week ago, would respond to the movie.\n\nWell, it "worked" for both of you, as we say in Grandpa Roger\'s business.\n\nRaven, you never took your eyes off the screen--not even when it looked like E.T. was dying and you had to scoot over next to me because you were afraid.\n\nEmil, you had to go sit on your dad\'s knee a couple of times, but you never stopped watching, either.

So, now we want to open the ET text file >> "11-e.t.-the-extra-terrestrial.txt"
This is basically everything after the last backslash in the URL posted above
In order to get this programatically, we can use Python's split function

In [11]:
# Use the python split function to get text file name from end of URL string, and join it to the folder path we created earlier
# we need to open it in WB mode (Write Binary) and that's because the response.contents is in bytes, not text
with open(os.path.join(folder_name,
                       url.split('/')[-1]), mode='wb') as file:
    file.write(response.content)
    
# And there you have it, we have now created a text file, in our 'ebert_reviews' folder path, that has gathered the information
# from the URL on the Udacity webserver, and written it to a file in our drive location
# You can now open this file in the ebert_reviews folder and it will show as a regular txt document

#### Now let's look at downloading mutiple files in one go
We can build a loop that runs over all the URLs (assuming we had them in a list) and repeat the above actions for each

In [12]:
# Using a URLs list for text files saved on Udacity servers
ebert_review_urls = ['https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9900_1-the-wizard-of-oz-1939-film/1-the-wizard-of-oz-1939-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9901_2-citizen-kane/2-citizen-kane.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9901_3-the-third-man/3-the-third-man.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9902_4-get-out-film/4-get-out-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9902_5-mad-max-fury-road/5-mad-max-fury-road.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9902_6-the-cabinet-of-dr.-caligari/6-the-cabinet-of-dr.-caligari.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9903_7-all-about-eve/7-all-about-eve.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9903_8-inside-out-2015-film/8-inside-out-2015-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9903_9-the-godfather/9-the-godfather.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_10-metropolis-1927-film/10-metropolis-1927-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_11-e.t.-the-extra-terrestrial/11-e.t.-the-extra-terrestrial.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_12-modern-times-film/12-modern-times-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_14-singin-in-the-rain/14-singin-in-the-rain.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9905_15-boyhood-film/15-boyhood-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9905_16-casablanca-film/16-casablanca-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9905_17-moonlight-2016-film/17-moonlight-2016-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9906_18-psycho-1960-film/18-psycho-1960-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9906_19-laura-1944-film/19-laura-1944-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9906_20-nosferatu/20-nosferatu.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9907_21-snow-white-and-the-seven-dwarfs-1937-film/21-snow-white-and-the-seven-dwarfs-1937-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9907_22-a-hard-day27s-night-film/22-a-hard-day27s-night-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9907_23-la-grande-illusion/23-la-grande-illusion.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9908_25-the-battle-of-algiers/25-the-battle-of-algiers.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9908_26-dunkirk-2017-film/26-dunkirk-2017-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9908_27-the-maltese-falcon-1941-film/27-the-maltese-falcon-1941-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9909_29-12-years-a-slave-film/29-12-years-a-slave-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9909_30-gravity-2013-film/30-gravity-2013-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9909_31-sunset-boulevard-film/31-sunset-boulevard-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990a_32-king-kong-1933-film/32-king-kong-1933-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990a_33-spotlight-film/33-spotlight-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990a_34-the-adventures-of-robin-hood/34-the-adventures-of-robin-hood.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990b_35-rashomon/35-rashomon.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990b_36-rear-window/36-rear-window.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990b_37-selma-film/37-selma-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990c_38-taxi-driver/38-taxi-driver.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990c_39-toy-story-3/39-toy-story-3.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990c_40-argo-2012-film/40-argo-2012-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_41-toy-story-2/41-toy-story-2.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_42-the-big-sick/42-the-big-sick.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_43-bride-of-frankenstein/43-bride-of-frankenstein.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_44-zootopia/44-zootopia.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990e_45-m-1931-film/45-m-1931-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990e_46-wonder-woman-2017-film/46-wonder-woman-2017-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990e_48-alien-film/48-alien-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990f_49-bicycle-thieves/49-bicycle-thieves.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990f_50-seven-samurai/50-seven-samurai.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990f_51-the-treasure-of-the-sierra-madre-film/51-the-treasure-of-the-sierra-madre-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9910_52-up-2009-film/52-up-2009-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9910_53-12-angry-men-1957-film/53-12-angry-men-1957-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9910_54-the-400-blows/54-the-400-blows.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9911_55-logan-film/55-logan-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9911_57-army-of-shadows/57-army-of-shadows.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9912_58-arrival-film/58-arrival-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9912_59-baby-driver/59-baby-driver.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_60-a-streetcar-named-desire-1951-film/60-a-streetcar-named-desire-1951-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_61-the-night-of-the-hunter-film/61-the-night-of-the-hunter-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_62-star-wars-the-force-awakens/62-star-wars-the-force-awakens.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_63-manchester-by-the-sea-film/63-manchester-by-the-sea-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9914_64-dr.-strangelove/64-dr.-strangelove.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9914_66-vertigo-film/66-vertigo-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9914_67-the-dark-knight-film/67-the-dark-knight-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9915_68-touch-of-evil/68-touch-of-evil.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9915_69-the-babadook/69-the-babadook.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9915_72-rosemary27s-baby-film/72-rosemary27s-baby-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9916_73-finding-nemo/73-finding-nemo.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9916_74-brooklyn-film/74-brooklyn-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9917_75-the-wrestler-2008-film/75-the-wrestler-2008-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9917_77-l.a.-confidential-film/77-l.a.-confidential-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9918_78-gone-with-the-wind-film/78-gone-with-the-wind-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9918_79-the-good-the-bad-and-the-ugly/79-the-good-the-bad-and-the-ugly.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9918_80-skyfall/80-skyfall.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_82-tokyo-story/82-tokyo-story.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_83-hell-or-high-water-film/83-hell-or-high-water-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_84-pinocchio-1940-film/84-pinocchio-1940-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_85-the-jungle-book-2016-film/85-the-jungle-book-2016-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991a_86-la-la-land-film/86-la-la-land-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991b_87-star-trek-film/87-star-trek-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991b_89-apocalypse-now/89-apocalypse-now.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991c_90-on-the-waterfront/90-on-the-waterfront.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991c_91-the-wages-of-fear/91-the-wages-of-fear.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991c_92-the-last-picture-show/92-the-last-picture-show.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991d_93-harry-potter-and-the-deathly-hallows-part-2/93-harry-potter-and-the-deathly-hallows-part-2.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991d_94-the-grapes-of-wrath-film/94-the-grapes-of-wrath-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991d_96-man-on-wire/96-man-on-wire.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_97-jaws-film/97-jaws-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_98-toy-story/98-toy-story.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_99-the-godfather-part-ii/99-the-godfather-part-ii.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_100-battleship-potemkin/100-battleship-potemkin.txt']

In [14]:
# run a check for an existing directory to load files into, if it doesnt exist, build it 
folder_name = 'ebert_reviews'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

# So, using a For Loop, we iterate over each of the URLs listed above, reading the contents, and writing to a file within 
# our 'ebert_reviews' folder
for review_url in ebert_review_urls:
    response = requests.get(review_url)
    with open(os.path.join(folder_name, review_url.split('/')[-1]), mode='wb') as file:
        file.write(response.content)

# if we check the 'ebert_reviews' folder, we should now see all the 100 film text files stored there

## Text Files in Python

#### Text file structure
- for a useful lesson on encoding ... unicoding ... ASCII etc. go to this link:
        http://kunststube.net/encoding/
        
- this spells out what encoding is & why its important for a programmer!

#### Unicode & Python
In Python 3, there is:
- one text type: ***str*** which holds Unicode data
- two byte types: ***bytes*** & ***bytearray***

In [1]:
# we can use the glob library as a really handy way to specify text file importing 
# it allows functionality to use wildcards/patterns that can grab certain files 
import glob

In [2]:
# run a test and print every file name 
for review in glob.glob('ebert_reviews/*.txt'):
    print(review)

ebert_reviews\1-the-wizard-of-oz-1939-film.txt
ebert_reviews\10-metropolis-1927-film.txt
ebert_reviews\100-battleship-potemkin.txt
ebert_reviews\11-e.t.-the-extra-terrestrial.txt
ebert_reviews\12-modern-times-film.txt
ebert_reviews\14-singin-in-the-rain.txt
ebert_reviews\15-boyhood-film.txt
ebert_reviews\16-casablanca-film.txt
ebert_reviews\17-moonlight-2016-film.txt
ebert_reviews\18-psycho-1960-film.txt
ebert_reviews\19-laura-1944-film.txt
ebert_reviews\2-citizen-kane.txt
ebert_reviews\20-nosferatu.txt
ebert_reviews\21-snow-white-and-the-seven-dwarfs-1937-film.txt
ebert_reviews\22-a-hard-day27s-night-film.txt
ebert_reviews\23-la-grande-illusion.txt
ebert_reviews\25-the-battle-of-algiers.txt
ebert_reviews\26-dunkirk-2017-film.txt
ebert_reviews\27-the-maltese-falcon-1941-film.txt
ebert_reviews\29-12-years-a-slave-film.txt
ebert_reviews\3-the-third-man.txt
ebert_reviews\30-gravity-2013-film.txt
ebert_reviews\31-sunset-boulevard-film.txt
ebert_reviews\32-king-kong-1933-film.txt
ebert_revi

In [5]:
# so when we want to open a file & read the contents, we should alwasy spceify the encoding. Helps throw errors & debug quiker
# lets look at reading the first line only, of one text file. we'll use the print & break to stop after 1 file
for review in glob.glob('ebert_reviews/*.txt'):
    with open(review, encoding='utf-8') as file:
        print(file.readline()[:-1])
        break      
# we'll see the first line, of the wizard of oz film review printed, which is the movie title 
# we use the :-1 slice to get rid of the /n that exists within the file to denote the end of line (causes a blank line if not removed)

The Wizard of Oz (1939)


In [9]:
# The most efficient method for us to build a dataframe of all the info from these text files is to create an empty list
# then append dictionaries we create from ready each text file during the iteration, appending each dictionary into the list
# that list of dictionaries can then be converted to a pandas dataframe once all data is gathered
df_list = []
for review in glob.glob('ebert_reviews/*.txt'):
    with open(review, encoding='utf-8') as file:
        title = file.readline()[:-1]
        url = file.readline()[:-1]     # since url is on line 2, we can just use readline again to auto read the next line 
        review_text = file.read()     # will read all the remaining lines in the text file, aka The review
        df_list.append({'title': title,
                        'url': url,
                        'review_text': review_text
                       })

# convert to pandas dataframe 
import pandas as pd 
df = pd.DataFrame(df_list, columns = ['title', 'url', 'review_text'])
df.head() # check the top 5 lines

Unnamed: 0,title,url,review_text
0,The Wizard of Oz (1939),http://www.rogerebert.com/reviews/great-movie-...,As a child I simply did not notice whether a m...
1,Metropolis (1927),http://www.rogerebert.com/reviews/great-movie-...,The opening shots of the restored “Metropolis”...
2,Battleship Potemkin (1925),http://www.rogerebert.com/reviews/great-movie-...,"""The Battleship Potemkin” has been so famous f..."
3,E.T. The Extra-Terrestrial (1982),http://www.rogerebert.com/reviews/great-movie-...,Dear Raven and Emil:\n\nSunday we sat on the b...
4,Modern Times (1936),http://www.rogerebert.com/reviews/modern-times...,"A lot of movies are said to be timeless, but s..."
