# Data Sraping (CS109 - Lec2)

More data exploration than scraping but that's the title of the lecture so we'll go with it...

Based on: https://github.com/cs109/2015/tree/master/Lectures

In [12]:
## all imports
from IPython.display import HTML
import numpy as np
from urllib import request # urllib2 isn't supported by P3 - https://stackoverflow.com/questions/58794540/no-module-named-urllib2-how-do-i-use-it-in-python-so-i-can-make-a-request
import bs4 #this is beautiful soup
import time
import operator
import socket
import pickle # was CPickle, but again use pickly for P3
import re # regular expressions

from pandas import Series
import pandas as pd
from pandas import DataFrame

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set_context("talk")
sns.set_style("white")

from secrets import *

For this lecture I am supposed to use:
* Rotten Tomatoes API (no longer public, $60k+ per annum! - someone has realised the value of their data...)
* Twitter API: https://apps.twitter.com/app/new (think this will take some time to set up...)

But I'll hold off on this for now

## MovieLens Data Analysis

Credits: https://grouplens.org/datasets/movielens/

Cool publicly available dataset!

### Pulling in the data:

In [18]:
u_cols = ['user_id', 'age', 'sex','occupation', 'zip_code']
# headers of the file to be imported
users = pd.read_csv('http://files.grouplens.org/datasets/movielens/ml-100k/u.user', sep='|', names = u_cols)
# sep gives the delimiter of the file
users.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [19]:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv(
    'http://files.grouplens.org/datasets/movielens/ml-100k/u.data', 
    sep='\t', names=r_cols)
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [24]:
# the movies file contains columns indicating the movie's genres
# let's only load the first five columns of the file with usecols
m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']

movies = pd.read_csv('http://files.grouplens.org/datasets/movielens/ml-100k/u.item', sep='|', names = m_cols, usecols=range(5), encoding = 'latin')

movies.head()

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995)


### Getting info about our data:

In [25]:
print(movies.dtypes)

movie_id                int64
title                  object
release_date           object
video_release_date    float64
imdb_url               object
dtype: object


In [26]:
print(movies.describe())

          movie_id  video_release_date
count  1682.000000                 0.0
mean    841.500000                 NaN
std     485.695893                 NaN
min       1.000000                 NaN
25%     421.250000                 NaN
50%     841.500000                 NaN
75%    1261.750000                 NaN
max    1682.000000                 NaN


### Selecting Data:

* A Dataframe is a group of Series with a shared index
* A single column in a Dataframe is a Series

Pandas deals in Dataframes - learn more here: https://www.geeksforgeeks.org/python-pandas-dataframe/

In [27]:
users.head()
users['occupation'].head() # selecting only the occupation column

## I don't know why its so ugly all of a sudden!

0    technician
1         other
2        writer
3    technician
4         other
Name: occupation, dtype: object

In [28]:
columns_we_want = ['occupation', 'sex']
print(users[columns_we_want].head())

   occupation sex
0  technician   M
1       other   F
2      writer   M
3  technician   M
4       other   F


In [31]:
print(users.iloc[3]) # pulls the data for a particular row using the index [3]

user_id                4
age                   24
sex                    M
occupation    technician
zip_code           43537
Name: 3, dtype: object


### Filtering the Dataset

In [33]:
oldUsers = users[users.age > 25]
oldUsers.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
1,2,53,F,other,94043
4,5,33,F,other,15213
5,6,42,M,executive,98101
6,7,57,M,administrator,91344
7,8,36,M,administrator,5201
