<a href="https://colab.research.google.com/github/fsaldhaheri/rec3-demo/blob/main/PS4_fa2222.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
---
Problem Set 4: Web Scraping

Introduction to Computer Programming

New York University, Abu Dhabi

Out: 17th January 2022 || **Due: 20th January 2022 at 23:59**

---
---
#Start Here
## Learning Goals
### General Goals
- Learn the fundamental concepts of Web Scraping

### Specific Goals
- Learn how to identify websites that allow scraping
- Learn the basics of BeautifulSoup
- Learn to adjust the scraping code to more than one query

## Collaboration Policy
- You are expected to comply with the [University Policy on Academic Integrity and Plagiarism](https://www.nyu.edu/about/policies-guidelines-compliance/policies-and-guidelines/academic-integrity-for-students-at-nyu.html).
- You are allowed to talk with / work with other students on homework assignments.
- You can share ideas but not code, analyses or results; you must submit your own code and results. All submitted code will be compared against all code submitted this semester and online using MOSS. We will also critically analyze the similarities in the submitted reports, methodologies, and results, **but we will not police you**. We expect you all to be responsible enough to finish your work with full integrity.

## Late Submission Policy
You can submit the homework up to 2 late days. However, we will deduct 20% of your homework grade **for each late day you take**. We will not accept the homework after 2 late days.  Late submissions should be sent via email after the online submission is closed.

## Distribution of Class Materials
These problem sets and recitations are intellectual property of NYUAD, and we request the students to **not** distribute them or their solutions to other students who have not signed up for this class, and/or intend to sign up in the future. We also request you don't post these problem sets, and recitations online or on any public platforms.

## Disclaimer
The number of points do not necessarily signify/correlate to the difficulty level of the tasks.

## Submission
You will submit all your code as a Python Notebook, as well as the 2 CSV files obtained from Part II through [Brightspace](https://brightspace.nyu.edu/) as **P4_YOUR NETID.ipynb**.

---

# General Instructions


This homework is worth 100 points. It has 3 parts. Below each part, we provide a set of concepts required to complete that part. All the parts need to be completed in a Jupyter (Colab) Notebook attached with this handout. We recommend that you read the complete handout before starting the homework.

<font color="red">**Important Note:** Please scrape the websites ethically and make sure you utilize the `sleep` function between requests. </font>


# Part I: To Scrape or Not to Scrape (5 points)

A friend has a brilliant idea and wants us to write a program that will scrape product data from 3 different websites, and will recommend which one to buy based on price and reviews. However, we need to figure out which of the below websites actually allow scraping:

- [Amazon.ae](https://www.amazon.ae/)
- [Jumbo.ae](https://www.jumbo.ae/)
- [Instock.ae](https://www.instock.ae/)

Provide your answers, and, in a couple of sentences, describe how you have come to your answer.




Amazon.ae does not allow scraping for the English webpages with the exception of the wishlist creation page. A user needs to be a member in amazon.ae to access what' beyond these pages. However, amazon allows scraping the website in its Arabic version.

Allow: /wishlist/universal
Allow: /wishlist/vendor-button
Allow: /wishlist/get-button
Allow: /gp/wishlist/universal
Allow: /gp/wishlist/vendor-button
Allow: /gp/wishlist/ipad-install
Allow: /-/ar/

Jumbo.ae allows web scraping except for pages under /prodcuts/ and does not allow to access it's search engine. Moreover, it has a (Crawl-delay: 30) for user agents.

Disallow: /products/

Disallow: /search?q=*


Finally, Instock.ae does not allow web scraping on its website.


### *Concepts required to complete this task:*

- Concepts of Ethical Scraping

Basic common sense applies to web scraping just like it would apply when interacting with other humans. Simple manners, such as asking for permission and thanking the host page, go a long way. Below are some basic rules to follow written by by Jami from [Empirical Data](https://www.empiricaldata.org/dataladyblog/a-guide-to-ethical-web-scraping).

> 
1.   Just like knoking on the door of a house requesting permission to enter, a web scraper would ask for permission to access the website if not clearly stated.


2.   Just like introducing yourself to people who don't know you, a webscraper would give some information about their identity by their User-Agent string.

3.   Knowing the rules is a must in a civil society. Similarly, being aware of how terms and conditions,and robot.txt govern your behavior on a page reflects user prudence.

4.   Use the data that you need and not more. Use it with consideration without causing any financial harm to the source that provided you with this information.

5.   Finally in your research and final output. Thank and referece the source that provided you the information without excecting anything in return.

## Rubric

- +5 points for correct answer and pointing the place you found the information

# Part II: Hindi Geet Mala (50 points)

[Hindi Geet Mala](https://www.hindigeetmala.net/) is a website containing information about Indian movies, songs, singers, etc. For this part, you will scrape information about all the movies in **alphabetical order**. Additionally, you will scrape information about songs. 

More precisely, you will submit 2 CSV files:

* movies.csv: Title, Year, Number of Songs, Film Director, Film Producer, Film cast, Lyricist, Music Director, Singer, External links, Watch Full Movie

* songs.csv: Artists, Title, Rating, Number of Votes, Movie Title

*Note: you are NOT allowed to manually write the list of letters or use a list generator for the alphabet. The list should be retrieved from the page.*

In [27]:
import requests
import random
import numpy as np
import pandas as pd
import json
from bs4 import BeautifulSoup
import re
import string
import csv
import time

In [None]:
# code to scrape movies' number or ratings on hindigeetmala.net

header ={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}


# scrape the homepage for years in which movies were produced
url = "https://www.hindigeetmala.net"

# Sending request and getting content
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
years = soup.find_all('a', attrs={'class':'head1'})

# creating an empty dataframe that will store movie titles,
# year of prodcution, and number of ratings
movies = {'Title': [0], 'Year': [0], 'Number of Songs': [0],'Film Director': [0],'Film Producer': [0],
          'Film cast': [0],'Lyricist': [0],'Music Director': [0],'Singer': [0],'External links': [0],'Watch Full Movie': [0],}
movies_df = pd.DataFrame(data = movies)
movies_df = movies_df.drop(movies_df.index[0]).reset_index(drop=True)

songs = {'Title': [0], 'Year': [0], 'Number of Songs': [0],'Film Director': [0],'Film Producer': [0],
          'Film cast': [0],'Lyricist': [0],'Music Director': [0],'Singer': [0],'External links': [0],'Watch Full Movie': [0],}
songs_df = pd.DataFrame(data = movies)
songs_df = songs.drop(songs.index[0]).reset_index(drop=True)



# setting empty URL to preserve years
url_movies = set()

# loop and extract years from homepage
for i in years:
    if (i.get('href')[0:6] == '/movie') and (i.get('href')[-8:-5].isnumeric()):
        # year will be added as a string to homepage URL
        year = i.get('href').replace("/movie/","").replace(".php","").replace("_"," ")
        # creating a pandas series to add to the dataframe df
        year_df = pd.Series({"Year":year})
        # URL of each year
        for dummy in range(1,2):
            url_movies = url+str(i.get('href'))+'?page='+str(dummy)
            # print(year, i.get('href'), url_movies)

            # Sending request to get content and filter out movies' URL
            time.sleep(1)
            req = requests.get(url_movies)
            soup = BeautifulSoup(req.text, "html.parser")
            table = soup.find('table', attrs={'class':'b1 w760 alcen'})
            movies = table.find_all('a', attrs={'href':True})

            # empty string to avoid storing redundant listings of the same movie
            ''' how can we stop pages from repeating '''
            redundant_movie = ''

            # extracting movie names & URLs, and adding them to a pandas series
            for movie in movies:
                time.sleep(1)
                if (movie.get('href')[0:6] == '/movie') and (movie.get('href')[-3:]=='htm'):
                    if (movie.get('href') != redundant_movie):
                        redundant_movie = movie.get('href')
                        movie_name = movie.get('href').replace("/movie/","").replace(".htm","").replace("_"," ")
                        name_df = pd.Series({"Movie":movie_name})
                        #print(year_df, name_df, movie_name, movie.get('href'),movie['href'])

                        # on the movies page, we count the number of songs by adding up the rating count
                        url_final = "https://www.hindigeetmala.net"+str(movie['href'])
                        
                        # print(url_final)
                        time.sleep(1)
                        req = requests.get(url_final)
                        soup = BeautifulSoup(req.text, "html.parser")
                        table = soup.find('td', attrs={'class':'w760 vatop alcen'})
                        songs = table.find_all('span', attrs={'itemprop':'ratingCount'})

                        counter=0
                        for song in songs:
                            counter += 1
                        ratings = counter
                        ratings_df = pd.Series({"Ratings":ratings})

                        # append each row to the dataframe
                        new_row = {'Movie': movie_name, 'Year': year, 'Count': ratings}
                        df = df.append(new_row, ignore_index=True)
                        
# drop the first row
df = df.drop(df.index[0]).reset_index(drop=True)
print(df)

#the end

In [None]:
############# SOLUTION ###############
##### Solution start ########


# needed to go through the pages in alphabatecial order
# one more page is 0-9 which we add after the loop



'''
# loop through the pages in alphabetical order
for i in range(len(alpha)):
    url_alpha = "https://www.hindigeetmala.net/movie/"+str(alpha[i])+".php"
    response = requests.get(url_alpha)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # extract last 4 characters to get max number of pages per letter
    last_page = soup.title.string[-5:]
    # remove letters and spaces and leave a number
    last_page = int(last_page.translate({ord(i): None for i in 'of '}))
    
    #looping through each letter and all of the pages under it.
    for i in range(1, last_page+1):
        # taking above url and adding numbers
        url_number = url_alpha + '?page=' + str(i)
        
        # request and organize content of webpage 
        response = requests.get(url_number)
        soup = BeautifulSoup(response.content, 'html.parser')

        # extract the names of all images on page
        names = soup.find_all('img', alt=True)
        
        # remove names of non-movie images.
        namelist = []
        for name in names:
            if (name['alt'] != 'new') & (name['alt'] != 'Movie') & (name['alt'] != 'next page') & (name['alt'][-3:] != 'jpg'):
                namelist.append(name['alt'])
                print(namelist)

# later
# we are testing out namelist on a single page
# this is part of the for loop above but i am seperating it to avoid the time
# needed to run the loop
# 
url_number = "https://www.hindigeetmala.net/movie/a.php?page=3"
response = requests.get(url_number)
soup = BeautifulSoup(response.content, 'html.parser')
names = soup.find_all('img', alt=True)

namelist = []
for name in names:
    #print(name['alt'])
    if ((name['alt'] == 'new') | 
        (name['alt'] == 'Movie') | 
        (name['alt'] == 'next page') | 
        (name['alt'][-3:] == 'jpg')):
        pass
    elif (name['alt'][-1] == ')'):
        if (name['alt'][-6] == '('):
            pass
    else:
        namelist.append(name['alt'])
        print(namelist)

# extract the years of each movie to average but this number later
# next is to make a url with the name of the movies at the end attached
# loop through each movie page and gather all songs and count them.
# print a datafram with the average of each year and the year next to it
############# SOLUTION END ###############

'''

url = "https://www.hindigeetmala.net"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
alpha = soup.find_all('a', attrs={'class':'head1'})

for a in alpha:

    # movies urls and therefore information starts here
    if 'movie' in a['href']:
        # condition to avoid any string with a number
        # we add 0-9 manually
        if (not any(i.isdigit() for i in a['href'])):
            print('movie string: ', a['href'])
            
            url_alpha = "https://www.hindigeetmala.net"+str(a['href'])
            response = requests.get(url_alpha)
            soup = BeautifulSoup(response.content, 'html.parser')
            
            # extract last 4 characters to get max number of pages per letter
            last_page = soup.title.string[-5:]
            # remove letters and spaces and leave a number
            last_page = int(last_page.translate({ord(i): None for i in 'of '}))
            
            #looping through each letter and all of the pages under it.
            for i in range(1, last_page+1):
                # taking above url and adding numbers
                url_number = url_alpha + '?page=' + str(i)
                
                # request and organize content of webpage 
                response = requests.get(url_number)
                soup = BeautifulSoup(response.content, 'html.parser')

                # extract the names of all images on page
                names = soup.find_all('img', alt=True)
                
                # remove names of non-movie images.
                namelist = []
                for name in names:
                    if (name['alt'] != 'new') & (name['alt'] != 'Movie') & (name['alt'] != 'next page') & (name['alt'][-3:] != 'jpg'):
                        namelist.append(name['alt'])
                        print(namelist)


    # songs urls and therefore information starts here
    elif 'index' not in a['href']:
        print('singer string: ', a['href'])

        '''
        movies_df = pd.Series({})

        if c.string != 'Category':
            txt = c.string
            cat_title = txt.split('/movies')[0]
            freq = txt.split(' (')[1][:-1]
            df2 = pd.Series({"Category":cat_title,"Frequency":freq})
            df = df.append(df2, ignore_index=True)

        df = pd.DataFrame(columns=["Category","Frequency"])
        '''

### STEP 1: Finding URL alphabets for movies and songs 

In [None]:

url = "https://www.hindigeetmala.net"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
alpha = soup.find_all('a', attrs={'class':'head1'})

for a in alpha:
    # movies urls and therefore information starts here
    if 'movie' in a['href']:
        # condition to avoid any string with a number
        # we add 0-9 manually
        if (not any(i.isdigit() for i in a['href'])):
            print('movie string:', a['href'])
    # songs urls and therefore information starts here
    else:
        if 'index' not in a['href']:
            print('singer string: ', a['href'])

We found all letters. We move to numbers

In [None]:
url_alpha = "https://www.hindigeetmala.net/movie/a.php"
response = requests.get(url_alpha)
soup = BeautifulSoup(response.content, 'html.parser')

# extract last 4 characters to get max number of pages per letter
last_page = soup.title.string[-5:]
# remove letters and spaces and leave a number
last_page = int(last_page.translate({ord(i): None for i in 'of '}))
      

Now, we can loop through all pages under a letter

In [None]:
# this code loops through numbers of each letter 
for i in range(1, last_page+1):
    print(i)

### STEP 2: obtaining URL movie names

The snippets above and below are together but not to call the website everytime.

In [None]:
for name in names:
    if ((name['alt'] != 'new') & (name['alt'] != 'Movie') & (name['alt'] != 'next page') & (name['alt'][-3:] != 'jpg')):
        print(name['alt'])

I have the names. I need to append them to a series 

1.   remove year
2.   add underscore
3.   create URL

In [None]:
#  remove year

#   add underscore

#   create movie page URL


### Step 3: Extract info from movie page

In [None]:

parsed = json.loads(response.text)
print(json.dumps(parsed, indent=4, sort_keys=True)) #this code prettifies the content for you for readability by indenting it

Scrape the whole page and view soup

<title>
   A - Alphabetically List of Hindi Films - Starting from A - Page 1 of 50
  </title>

  

In [None]:
url_number = "https://www.hindigeetmala.net/movie/a.php?page=3"
response = requests.get(url_number)
soup = BeautifulSoup(response.content, 'html.parser')

In [None]:
# we can either use find_all and then find
table = soup.find_all('table', attrs={'class':'b1 w760 alcen'})
names = table[0].find('img', attrs={'alt':True})

In [None]:
# or we find all and then loop through each item
names = table.find_all('img', attrs={'alt':True})
for item in names:
    item['alt']

In [None]:
# this is a second way where we start with find then find all
table = soup.find('table', attrs={'class':'b1 w760 alcen'})
names = table.find_all('img', attrs={'alt':True})
print(names)

In [None]:
# taking above url and adding numbers
url = "https://www.hindigeetmala.net/movie/aakhree_raasta.htm"
    
# request and organize content of webpage 
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
#print(soup.prettify())  # We call prettify for easier viewing

# movie title
movie_title = soup.title.string.split(' :')[0]
# movie year
movie_year = soup.title.string[-5:-1]

# print(soup.title)
actors = soup.find_all("span", itemprop="name")
#actors = soup.find_all("td", itemprop="actor")
ratings = soup.find_all("span", itemprop="ratingValue")

ratingcount = soup.find_all("span", itemprop="ratingCount")

counter=0
for count in ratingcount:
    print(count.string)
    counter+=1

# number of songs
number_of_songs = counter

for rate in ratings:
    print(rate.string)

22
6
3
12
4
1
6
3.41
4.67
4.67
3.50
4.75
1.00


Trying to extract actors

In [None]:
for actor in actors:
    # while (actor.string != name):
    # Each of the elements in this list is a span object
    print(actor.string)

 Gori Ka Sajan Sajan Ki Gori 
Aakhree Raasta
 Gori Kaa Saajan 
Aakhree Raasta
 Hey You Pahale Padhaai Phir Pyaar Hogaa 
Aakhree Raasta
 Tune Mera Dudh Piya Hai Tu Bilkul Mere Jaisa Hai 
Aakhree Raasta
 Tera Dudh Aisa Hai 
Aakhree Raasta
 Dance Music (Aakhree Raasta) 
Aakhree Raasta
Aakhree Raasta
Amitabh Bachchan
Jaya Prada
Sridevi
Anupam Kher
Sadashiv Amrapurkar
Om Shiv Puri
Dalip Tahil
Bharat Kapoor
Viju Khote
Gurbachan
Jagdarshan
Umesh Sharma
Vijay Kumar
Anand
Ashok Kumar
Kiran
Hamid
Dilip
Laxmikant  Pyarelal
K Bhagyaraj
Purnachandra Rao


Start extracting information we select

In [None]:
# extract the names of all images on page
names = soup.find_all('meta', alt=True)
print(names[2])

In [None]:



endpoint = "http://www.thecocktaildb.com/api/json/v1/1/search.php?s=margarita"
response = requests.get(endpoint)
response.text
# To make this more readable, let's use the json library
import json

parsed = json.loads(response.text)
print(json.dumps(parsed, indent=4, sort_keys=True)) #this code prettifies the content for you for readability by indenting it

### *Concepts required to complete this task:*

- Navigating through HTML code using functions
- DataFrame Creation and Writing to a file


## Rubric

- +30 points for correct output (15 points for each dataframe)
- +15 points for concise, logical code and strategic scraping
- +3 points for ethical and mindful scraping
- +2 points for comments and variable names

# Part III: Yahoo Finance (45 points)

Write a function called `stock_decision()` that will take as an input the list of company abbreviations (e.g. the URL for Facebook is https://finance.yahoo.com/quote/FB/history/, so this company's abbreviation is "FB"), and will scrape the stock market close/last data for the given companies from Yahoo Finance, compute the weekly average for the past 3 weeks and check whether or not it has been increasing. If the weekly average has been steadily increasing, recommend to buy, otherwise recommend to sell.

The output should be two lines: 

```
"Sell: "

"Buy: "
```

When you implement the function, call it with these companies: 
- Apple
- Microsoft
- Amazon
- Tesla
- Facebook 





*Hint: For grouping the data per week, look up `resample` function from Pandas. Another function that will come in handy is `diff` from Pandas.*

In [28]:
############# SOLUTION ###############

header = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
headers = header
url = 'https://finance.yahoo.com/quote/ASPL.L/history'

# time.sleep(1)
req = requests.get(url)
print(req.status_code)
print(req.text)
soup = BeautifulSoup(req.text, "html.parser")
table = soup.find('table', attrs={'class':'W(100%) M(0)'})
price = table.find_all('td', attrs={'class':True})

print(price)

############# SOLUTION END ###############

404
<!DOCTYPE html>
  <html lang="en-us"><head>
  <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <meta charset="utf-8">
      <title>Yahoo</title>
      <meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui">
      <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
      <style>
  html {
      height: 100%;
  }
  body {
      background: #fafafc url(https://s.yimg.com/nn/img/sad-panda-201402200631.png) 50% 50%;
      background-size: cover;
      height: 100%;
      text-align: center;
      font: 300 18px "helvetica neue", helvetica, verdana, tahoma, arial, sans-serif;
  }
  table {
      height: 100%;
      width: 100%;
      table-layout: fixed;
      border-collapse: collapse;
      border-spacing: 0;
      border: none;
  }
  h1 {
      font-size: 42px;
      font-weight: 400;
      color: #400090;
  }
  p {
      color: #1A1A1A;
  }
  #message-1 {
      font-weight: bold;
      margin: 0;
  }
  #message-2 {
      displ

AttributeError: ignored

### *Concepts required to complete this task:*

- Navigating through HTML code using functions
- DataFrame Creation and Manipulations
- Applying functions to a DataFrame


## Rubric

- +25 points for scraping the correct information
- +10 points for concise and logical code
- +5 points for correctly and efficiently calculating the averages, percent change
- +3 points for ethical and mindful scraping
- +2 points for comments and logical variable names