# ADS 509 Module 1: APIs and Web Scraping

This notebook has three parts. In the first part you will pull data from the Twitter API. In the second, you will scrape lyrics from AZLyrics.com. In the last part, you'll run code that verifies the completeness of your data pull. 

For this assignment you have chosen two musical artists who have at least 100,000 Twitter followers and 20 songs with lyrics on AZLyrics.com. In this part of the assignment we pull the some of the user information for the followers of your artist and store them in text files. 


## Important Note

This assignment requires you to have a version of Tweepy that is at least version 4. The latest version is 4.10 as I write this. Critically, this version of Tweepy is *not* on the upgrade path from Version 3, so you will not be able to simply upgrade the package if you are on Version 3. Instead you will need to explicitly install version 4, which you can do with a command like this: `pip install "tweepy>=4"`. You will also be using Version 2 of the Twitter API for this assignment. 

Run the below cell. If your version of Tweepy begins with a "4", then you should be good to go. If it begins with a "3" then run the following command, found [here](https://stackoverflow.com/questions/5226311/installing-specific-package-version-with-pip), at the command line or in a cell: `pip install -Iv tweepy==4.9`. (You may want to update that version number if Tweepy has moved on past 4.9. 

In [1]:
# let's install the appropriate version of tweepy
!pip install "tweepy>=4"



In [2]:
# verify tweepy version
!pip show tweepy

Name: tweepy
Version: 4.10.1
Summary: Twitter library for Python
Home-page: https://www.tweepy.org/
Author: Joshua Roesslein
Author-email: tweepy@googlegroups.com
License: MIT
Location: /opt/anaconda3/lib/python3.8/site-packages
Requires: requests, oauthlib, requests-oauthlib
Required-by: 


# Twitter API Pull

In [3]:
# for the twitter section
import tweepy
import os
import datetime
import re
from pprint import pprint

# for the lyrics scrape section
import requests
import time
from bs4 import BeautifulSoup
from collections import defaultdict, Counter


In [4]:
# Use this cell for any import statements you add
import random
import pandas as pd

We need bring in our API keys. Since API keys should be kept secret, we'll keep them in a file called `api_keys.py`. This file should be stored in the directory where you store this notebook. The example file is provided for you on Blackboard. The example has API keys that are _not_ functional, so you'll need to get Twitter credentials and replace the placeholder keys. 

In [5]:
# contents of api_keys.py will be hidden from repository to maintain secrecy
# api_keys.py was created manually with API keys copied over
from api_keys import api_key, api_key_secret, bearer_token

In [6]:
client = tweepy.Client(bearer_token,wait_on_rate_limit=True)

# Testing the API

The Twitter APIs are quite rich. Let's play around with some of the features before we dive into this section of the assignment. For our testing, it's convenient to have a small data set to play with. We will seed the code with the handle of John Chandler, one of the instructors in this course. His handle is `@37chandler`. Feel free to use a different handle if you would like to look at someone else's data. 

We will write code to explore a few aspects of the API: 

1. Pull some of the followers @37chandler.
1. Explore response data, which gives us information about Twitter users. 
1. Pull the last few tweets by @37chandler.


In [7]:
# this identifies the twitter handle we'll be basing our pulls from
handle = "37chandler"
user_obj = client.get_user(username=handle)

# see https://docs.tweepy.org/en/v4.0.1/client.html
# client.get_users_followers retrieves follower data of specified handle
followers = client.get_users_followers(
    # Learn about user fields here: 
    # https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user
    user_obj.data.id, user_fields=["created_at","description","location",
                                   "public_metrics"]
)



Rate limit exceeded. Sleeping for 405 seconds.


Now let's explore these a bit. We'll start by printing out names, locations, following count, and followers count for these users. 

In [8]:
num_to_print = 20

for idx, user in enumerate(followers.data) :
    # let's separate the public_metrics into following and follower counts
    following_count = user.public_metrics['following_count']
    followers_count = user.public_metrics['followers_count']
    
    print(f"{user.name} lists '{user.location}' as their location.")
    print(f" Following: {following_count}, Followers: {followers_count}.")
    print()
    
    if idx >= (num_to_print - 1) :
        break
    

Dave Renn lists 'None' as their location.
 Following: 44, Followers: 10.

Lionel lists 'None' as their location.
 Following: 202, Followers: 204.

Megan Randall lists 'None' as their location.
 Following: 141, Followers: 100.

Jacob Salzman lists 'None' as their location.
 Following: 562, Followers: 134.

twiter not fun lists 'None' as their location.
 Following: 221, Followers: 21.

Hariettwilsonincarnate lists 'None' as their location.
 Following: 219, Followers: 61.

Christian Tinsley lists 'None' as their location.
 Following: 2, Followers: 0.

Steve lists 'I'm over here.' as their location.
 Following: 1591, Followers: 33.

John O'Connor 🇺🇦 lists 'None' as their location.
 Following: 8, Followers: 1.

CodeGrade lists 'Amsterdam' as their location.
 Following: 2819, Followers: 425.

Cleverhood lists 'Providence, RI' as their location.
 Following: 2795, Followers: 3561.

Regina 🚶‍♀️🚲🌳 lists 'Minneapolis' as their location.
 Following: 2801, Followers: 3339.

Eric Hallstrom lists 'Mi

Let's find the person who follows this handle who has the most followers. 

In [9]:
max_followers = 0

for idx, user in enumerate(followers.data) :
    followers_count = user.public_metrics['followers_count']
    
    if followers_count > max_followers :
        max_followers = followers_count
        max_follower_user = user

        
print(max_follower_user)
print(max_follower_user.public_metrics)

WedgeLIVE
{'followers_count': 14197, 'following_count': 2221, 'tweet_count': 56123, 'listed_count': 218}


Let's pull some more user fields and take a look at them. The fields can be specified in the `user_fields` argument. 

In [10]:
response = client.get_user(id=user_obj.data.id,
                          user_fields=["created_at","description","location",
                                       "entities","name","pinned_tweet_id","profile_image_url",
                                       "verified","public_metrics"])

In [11]:
for field, value in response.data.items() :
    print(f"for {field} we have {value}")

for description we have He/Him. Data scientist, urban cyclist, educator, erstwhile frisbee player. 

¯\_(ツ)_/¯
for username we have 37chandler
for verified we have False
for location we have MN
for name we have John Chandler
for id we have 33029025
for created_at we have 2009-04-18 22:08:22+00:00
for profile_image_url we have https://pbs.twimg.com/profile_images/2680483898/b30ae76f909352dbae5e371fb1c27454_normal.png
for public_metrics we have {'followers_count': 192, 'following_count': 589, 'tweet_count': 997, 'listed_count': 3}


Now a few questions for you about the user object.

Q: How many fields are being returned in the `response` object? 

A: We have nine fields being returned in the `response` object. One of the fields ("public_metrics") returns four subfields.

---

Q: Are any of the fields within the user object non-scalar? (I.e., more complicated than a simple data type like integer, float, string, boolean, etc.) 

A: Yes, one of the fields ("profile_image_url") returns a url for a profile image.

---

Q: How many friends, followers, and tweets does this user have? 

A: This user has 589 friends, 192 followers, and 997 tweets.


Although you won't need it for this assignment, individual tweets can be a rich source of text-based data. To illustrate the concepts, let's look at the last few tweets for this user. You are encouraged to explore the fields that are available about Tweets.

In [12]:
# see https://docs.tweepy.org/en/v4.0.1/client.html
# this retrieves tweets from specified handle (user_obj.data.id)
response = client.get_users_tweets(user_obj.data.id)

# By default, only the ID and text fields of each Tweet will be returned
for idx, tweet in enumerate(response.data) :
    print(tweet.id)
    print(tweet.text)
    print()
    
    # the following indicates that when the idx iteration is greater than 10 -> stop
    if idx > 10 :
        break

1569760631548690437
RT @dtmooreeditor: So there's a particular quirk of English grammar that I've always found quite endearing: the exocentric verb-noun compou…

1569155273742327811
As a Minneapolis person, I knew we had Toronto beat, but I didn't realize Portland had us beat: https://t.co/xrx5mOFcWK.

But @nytimes, c'mon! https://t.co/M9mBWhdgsj

1568982292923826176
RT @wonderofscience: Amazing lenticular cloud over Mount Fuji

Credit: Iurie Belegurschi
https://t.co/0mUxl28H9U

1568242374085869570
RT @depthsofwiki: lots of memes about speedy wikipedia editors — quick thread about what went down on wikipedia in the minutes after her de…

1568074978754703361
@DrLaurenWilson @leighradwood @MaritsaGeorgiou @Walgreens I could not possibly agree more with this sentiment. Compared to almost any other primary care I've received, they are great.

1567530169686196224
@DrLaurenWilson @MaritsaGeorgiou @Walgreens For those who have access to Curry Health Center on campus, you can get a bivalent bo

## Pulling Follower Information

In this next section of the assignment, we will pull information about the followers of your two artists. We've seen above how to pull a set of followers using `client.get_users_followers`. This function has a parameter, `max_results`, that we can use to change the number of followers that we pull. Unfortunately, we can only pull 1000 followers at a time, which means we will need to handle the _pagination_ of our results. 

The return object has the `.data` field, where the results will be found. It also has `.meta`, which we use to select the next "page" in the results using the `next_token` result. I will illustrate the ideas using our user from above. 


### Rate Limiting

Twitter limits the rates at which we can pull data, as detailed in [this guide](https://developer.twitter.com/en/docs/twitter-api/rate-limits). We can make 15 user requests per 15 minutes, meaning that we can pull $4 \cdot 15 \cdot 1000 = 60000$ users per hour. I illustrate the handling of rate limiting below, though whether or not you hit that part of the code depends on your value of `handle`.  


In the below example, I'll pull all the followers, 25 at a time. (We're using 25 to illustrate the idea; when you do this set the value to 1000.) 

In [13]:
handle_followers = []
pulls = 0
max_pulls = 100
next_token = None

while True :

    followers = client.get_users_followers(
        user_obj.data.id, 
        max_results=25, # when you do this for real, set this to 1000!
        pagination_token = next_token,
        user_fields=["created_at","description","location",
                     "entities","name","pinned_tweet_id","profile_image_url",
                     "verified","public_metrics"]
    )
    pulls += 1
    
    for follower in followers.data : 
        follower_row = (follower.id,follower.name,follower.created_at,follower.description)
        handle_followers.append(follower_row)
    
    if 'next_token' in followers.meta and pulls < max_pulls :
        next_token = followers.meta['next_token']
    else : 
        break
        



## Pulling Twitter Data for Your Artists

Now let's take a look at your artists and see how long it is going to take to pull all their followers. 

In [14]:
artists = dict()

# this is a for loop to cycle through both MCR and Missy Elliot's twitter handles
for handle in ['MCRofficial','MissyElliott'] : 
    # client.get_user will get info about the handles
    # in this case, we're indicating the username to be the handles specified
    # the fields we'll retrieve will be info in public_metrics
    user_obj = client.get_user(username=handle,user_fields=["public_metrics"])
    # this specifies for an artist with a specific handle to generate their id, handle, and follower count
    artists[handle] = (user_obj.data.id, 
                       handle,
                       user_obj.data.public_metrics['followers_count'])
    
# this generates a print statement using info from artists
for artist, data in artists.items() : 
    print(f"It would take {data[2]/(1000*15*4):.2f} hours to pull all {data[2]} followers for {artist}. ")
    

It would take 24.61 hours to pull all 1476645 followers for MCRofficial. 
It would take 117.18 hours to pull all 7030860 followers for MissyElliott. 


Depending on what you see in the display above, you may want to limit how many followers you pull. It'd be great to get at least 200,000 per artist. 

As we pull data for each artist we will write their data to a folder called "twitter", so we will make that folder if needed.

In [15]:
# Make the "twitter" folder here. If you'd like to practice your programming, add functionality 
# that checks to see if the folder exists. If it does, then "unlink" it. Then create a new one.

if not os.path.isdir("twitter") : 
    #shutil.rmtree("twitter/")
    os.mkdir("twitter")

In this following cells, build on the above code to pull some of the followers and their data for your two artists. As you pull the data, write the follower ids to a file called `[artist name]_followers.txt` in the "twitter" folder. For instance, for Cher I would create a file named `cher_followers.txt`. As you pull the data, also store it in an object like a list or a data frame.

In addition to creating a file that only has follower IDs in it, you will create a file that includes user data. From the response object please extract and store the following fields: 

* screen_name	
* name	
* id	
* location	
* followers_count	
* friends_count	
* description

Store the fields with one user per row in a tab-delimited text file with the name `[artist name]_follower_data.txt`. For instance, for Cher I would create a file named `cher_follower_data.txt`. 

One note: the user's description can have tabs or returns in it, so make sure to clean those out of the description before writing them to the file. I've included some example code to do that below the stub. 

### Artist 1: My Chemical Romance (@MCRofficial)

In [16]:
# note the artists' twitter handles
handles = ['MCRofficial']
handle_followers = []

whitespace_pattern = re.compile(r"\s+")

user_data = dict() 
followers_data = dict()

for handle in handles :
    user_obj = client.get_user(username=handle) # this will specify the handles
    user_data[handle] = [] # will be a list of lists
    followers_data[handle] = [] # will be a simple list of IDs


# Grabs the time when we start making requests to the API
start_time = datetime.datetime.now()

for handle in handles :
    
    # Create the output file names 
    
    followers_output_file = handle + "_followers.txt"
    user_data_output_file = handle + "_follower_data.txt"
    
    # this produces a print statement indicating artist
    print(f'Pulling followers for {handle}.')
    
    # Using tweepy.Paginator (https://docs.tweepy.org/en/latest/v2_pagination.html), 
    # use `get_users_followers` to pull the follower data requested. 
    
    for followers in tweepy.Paginator(client.get_users_followers,
            user_obj.data.id,
            user_fields=["username","name","id","location","public_metrics","description"],
            max_results=1000,
            limit=100):
        print(followers.meta)
        
        for follower in followers.data:
            follower_row = {'id': follower.id,
                            'username': follower.username,
                            'name': follower.name,
                            'location': follower.location,
                            'follower_count': follower.public_metrics['followers_count'],
                            'following': follower.public_metrics['following_count'],
                            'description': follower.description
                           }
            user_data[handle].append(follower_row) # this appends all follower data
            followers_data[handle].append(follower.id) # this appends only follower ids
            
    # For each response object, extract the needed fields and store them in a dictionary or
    # data frame. 
    users_output_MCR = pd.DataFrame(user_data[handle], columns=['id','username',
                                                               'name', 'location',
                                                               'follower_count', 'following',
                                                               'description'])
    followers_output_MCR = pd.DataFrame(user_data[handle], columns=['id'])
    
    users_output_MCR.to_csv(f'twitter/{handle}_followers.txt', sep='\t', index=False)
    followers_output_MCR.to_csv(f'twitter/{handle}_follower_data.txt', sep='\t', index=False)
            
    print(f'Completed pulling follower data for {handle}.')
        
# Let's see how long it took to grab all follower IDs
end_time = datetime.datetime.now()
print(f'Time taken was {end_time - start_time}.')


Pulling followers for MCRofficial.
{'result_count': 1000, 'next_token': 'SL2BRT4RSOP1GZZZ'}
{'result_count': 1000, 'next_token': 'RL5MO38R04P1GZZZ', 'previous_token': '7MO6UELJ376UEZZZ'}
{'result_count': 1000, 'next_token': 'ERVPBVVABSOHGZZZ', 'previous_token': 'SI7NBTPRVV6UEZZZ'}
{'result_count': 1000, 'next_token': 'G4PC5CJAT8O1GZZZ', 'previous_token': 'BSGN3CPKK37EEZZZ'}
{'result_count': 1000, 'next_token': 'Q9HJUREOF0O1GZZZ', 'previous_token': 'NNG3ALCU2N7UEZZZ'}


Rate limit exceeded. Sleeping for 896 seconds.


{'result_count': 1000, 'next_token': 'PPJVFNQLSONHGZZZ', 'previous_token': '78G30DHUGV7UEZZZ'}
{'result_count': 1000, 'next_token': 'ET64BBB63CNHGZZZ', 'previous_token': 'MCQKDIN6378EEZZZ'}
{'result_count': 1000, 'next_token': 'VQ3CLCG0E8N1GZZZ', 'previous_token': 'ANOSE65OSJ8EEZZZ'}
{'result_count': 1000, 'next_token': '0CJGSCDGP8MHGZZZ', 'previous_token': 'HC61SB39HR8UEZZZ'}
{'result_count': 1000, 'next_token': '114EQD7F1CMHGZZZ', 'previous_token': 'ICR34TJ96N9EEZZZ'}
{'result_count': 1000, 'next_token': 'QPP6HU3CHOM1GZZZ', 'previous_token': '7RP16C1FUJ9EEZZZ'}
{'result_count': 1000, 'next_token': 'AIHK9NQT2KM1GZZZ', 'previous_token': 'C13IHOTHE79UEZZZ'}
{'result_count': 1000, 'next_token': 'UNEQL281BCLHGZZZ', 'previous_token': 'JRICTIT3TB9UEZZZ'}
{'result_count': 1000, 'next_token': 'VB7PQ2IQ94L1GZZZ', 'previous_token': 'HP58STCCKNAEEZZZ'}
{'result_count': 1000, 'next_token': 'CHF0OQTOSSK1GZZZ', 'previous_token': 'L3M7COLVMRAUEZZZ'}
{'result_count': 1000, 'next_token': 'T4PA6D2RQGJH

Rate limit exceeded. Sleeping for 893 seconds.


{'result_count': 1000, 'next_token': '72R0RI3MO4GHGZZZ', 'previous_token': 'C3MD6JQP6FEUEZZZ'}
{'result_count': 1000, 'next_token': 'TORE7D0TLSG1GZZZ', 'previous_token': 'MV139T8U7VFEEZZZ'}
{'result_count': 1000, 'next_token': 'CLV83VEBL8FHGZZZ', 'previous_token': '2URM95KIA7FUEZZZ'}
{'result_count': 1000, 'next_token': 'GJTQ2598IOF1GZZZ', 'previous_token': 'IKFBT026ANGEEZZZ'}
{'result_count': 1000, 'next_token': '4G5HFVJOAOEHGZZZ', 'previous_token': 'BO7PUG70D7GUEZZZ'}
{'result_count': 1000, 'next_token': 'VBVTJQ2924E1GZZZ', 'previous_token': 'NFI2IUKCL7HEEZZZ'}
{'result_count': 1000, 'next_token': '3VDF29FBQOD1GZZZ', 'previous_token': 'P6VJK3ULTRHUEZZZ'}
{'result_count': 1000, 'next_token': '1B7PGTOQPOCHGZZZ', 'previous_token': 'NUS44UQM57IUEZZZ'}
{'result_count': 1000, 'next_token': 'HR7GU279O4C1GZZZ', 'previous_token': 'NN185F896BJEEZZZ'}
{'result_count': 999, 'next_token': 'CJF26SHRLOBHGZZZ', 'previous_token': 'NC40BJH97RJUEZZZ'}
{'result_count': 1000, 'next_token': 'SMV6QSV8H8B1G

Rate limit exceeded. Sleeping for 892 seconds.


{'result_count': 1000, 'next_token': '981U50VA9O9HGZZZ', 'previous_token': 'EO6I6K4V1FMEEZZZ'}
{'result_count': 1000, 'next_token': 'KMHOFHKBJG91GZZZ', 'previous_token': 'FAG6TDGPM7MEEZZZ'}
{'result_count': 1000, 'next_token': 'TA0BGS330891GZZZ', 'previous_token': '5OVJ02BOCFMUEZZZ'}
{'result_count': 1000, 'next_token': 'TMNUQV5KH88HGZZZ', 'previous_token': 'KRQTHS54VNMUEZZZ'}
{'result_count': 1000, 'next_token': 'GH8T13OG0O8HGZZZ', 'previous_token': 'RDQ6P3KIENNEEZZZ'}
{'result_count': 1000, 'next_token': 'VKJ5S8KPF081GZZZ', 'previous_token': 'IFRC0AVJV7NEEZZZ'}
{'result_count': 1000, 'next_token': 'SRS057OG4C81GZZZ', 'previous_token': 'BRDE3MCHGVNUEZZZ'}
{'result_count': 1000, 'next_token': 'G77F72SAOO7HGZZZ', 'previous_token': 'O0PAP67PRJNUEZZZ'}
{'result_count': 1000, 'next_token': 'KAKC7N49C47HGZZZ', 'previous_token': 'BIO5EI4877OEEZZZ'}
{'result_count': 1000, 'next_token': 'K591EAP13O7HGZZZ', 'previous_token': '42692ADDJROEEZZZ'}
{'result_count': 1000, 'next_token': '9TMKM927NK71

Rate limit exceeded. Sleeping for 892 seconds.


{'result_count': 1000, 'next_token': 'E03AA3V1LS61GZZZ', 'previous_token': '3NEAEMN71NPUEZZZ'}
{'result_count': 1000, 'next_token': 'JPBBOIQNB461GZZZ', 'previous_token': 'RNDS1JICA3PUEZZZ'}
{'result_count': 1000, 'next_token': 'KVKIGSO06061GZZZ', 'previous_token': 'FMSKARDBKRPUEZZZ'}
{'result_count': 1000, 'next_token': '2QCLHAJQ1461GZZZ', 'previous_token': 'UQC5N302Q3PUEZZZ'}
{'result_count': 1000, 'next_token': '7GVJJ0V9UK5HGZZZ', 'previous_token': 'OPLQ5R47URPUEZZZ'}
{'result_count': 1000, 'next_token': 'HFT371T7T45HGZZZ', 'previous_token': 'O58K8SOR1BQEEZZZ'}
{'result_count': 1000, 'next_token': '0H5LAV55RS5HGZZZ', 'previous_token': '7895BFJ12RQEEZZZ'}
{'result_count': 1000, 'next_token': 'VQHG73EOQS5HGZZZ', 'previous_token': '7UN983QR43QEEZZZ'}
{'result_count': 1000, 'next_token': '792PDCAEPS5HGZZZ', 'previous_token': 'IOONL4H953QEEZZZ'}
{'result_count': 1000, 'next_token': '78KAM7USOG5HGZZZ', 'previous_token': 'U63U4CTI63QEEZZZ'}
{'result_count': 1000, 'next_token': 'TIBA701INO5H

Rate limit exceeded. Sleeping for 892 seconds.


{'result_count': 1000, 'next_token': '33HI91LGLK5HGZZZ', 'previous_token': '83S02N9SA3QEEZZZ'}
{'result_count': 1000, 'next_token': 'N0EQRT75LC5HGZZZ', 'previous_token': '5B1ABBAFABQEEZZZ'}
{'result_count': 1000, 'next_token': 'A11MJJRVL85HGZZZ', 'previous_token': 'R5NI0CGQAJQEEZZZ'}
{'result_count': 1000, 'next_token': 'LIOTSQADL45HGZZZ', 'previous_token': 'PL7I6GS0ANQEEZZZ'}
{'result_count': 1000, 'next_token': 'OTO4OQQ8L05HGZZZ', 'previous_token': 'IA0BN5TIARQEEZZZ'}
{'result_count': 1000, 'next_token': 'HLM3HFQBKS5HGZZZ', 'previous_token': 'BIGLR7DNAVQEEZZZ'}
{'result_count': 1000, 'next_token': 'F863FFIFKO5HGZZZ', 'previous_token': '6RNTL05LB3QEEZZZ'}
{'result_count': 1000, 'next_token': 'C92G3B9KKK5HGZZZ', 'previous_token': '6Q8B125GB7QEEZZZ'}
{'result_count': 1000, 'next_token': '9FRGCRMQMG51GZZZ', 'previous_token': 'RRFCBLEBBBQEEZZZ'}
{'result_count': 1000, 'next_token': '0ECQ6H4N9S4HGZZZ', 'previous_token': 'P6D57TEL9FQUEZZZ'}
{'result_count': 999, 'next_token': 'MDCB69NHOC3HG

Rate limit exceeded. Sleeping for 893 seconds.


{'result_count': 1000, 'next_token': 'PUBUK6HEGBVHEZZZ', 'previous_token': 'B4U0JKAERVVEEZZZ'}
{'result_count': 1000, 'next_token': 'MKTDRKHF53V1EZZZ', 'previous_token': 'NO6FRQOUFO0EGZZZ'}
{'result_count': 1000, 'next_token': '0NF8AJ03GRUHEZZZ', 'previous_token': '3PSN43ELQS0UGZZZ'}
{'result_count': 1000, 'next_token': 'P11NU6ITFRU1EZZZ', 'previous_token': 'AIHPM5G7F81EGZZZ'}
{'result_count': 1000, 'next_token': 'QE2STF577BTHEZZZ', 'previous_token': '46OJIO5IG41UGZZZ'}
{'result_count': 1000, 'next_token': 'E49VFL2VTNSHEZZZ', 'previous_token': 'H01RHGSKOK2EGZZZ'}
{'result_count': 1000, 'next_token': '78JK4DG77VSHEZZZ', 'previous_token': 'N9STDUD6283EGZZZ'}
{'result_count': 1000, 'next_token': 'C4E5OUK6EVS1EZZZ', 'previous_token': 'R6JODGUEO43EGZZZ'}
{'result_count': 999, 'next_token': 'SD959VO0CJRHEZZZ', 'previous_token': 'S1K413AUH43UGZZZ'}
{'result_count': 1000, 'next_token': 'ICL1S9TRVBR1EZZZ', 'previous_token': '36T7AND0JG4EGZZZ'}
{'result_count': 1000, 'next_token': '8C12F00IDVQHE

Rate limit exceeded. Sleeping for 892 seconds.


{'result_count': 1000, 'next_token': 'BCGUKT8CPJM1EZZZ', 'previous_token': '53G4PT54GG8UGZZZ'}
{'result_count': 1000, 'next_token': 'CA9DH0215VLHEZZZ', 'previous_token': 'GTF60OF16G9UGZZZ'}
{'result_count': 1000, 'next_token': 'T9PVP69SIBKHEZZZ', 'previous_token': 'H3EGE46HQ0AEGZZZ'}
{'result_count': 1000, 'next_token': 'JHOTTQ2D6BK1EZZZ', 'previous_token': 'CBM06RF8DKBEGZZZ'}
{'result_count': 1000, 'next_token': '5II1A7KKENJHEZZZ', 'previous_token': 'S43LHVVBPKBUGZZZ'}
Completed pulling follower data for MCRofficial.
Time taken was 1:45:08.517333.


In [17]:
# checking dataframes to ensure proper data pull
users_output_MCR.head()

Unnamed: 0,id,username,name,location,follower_count,following,description
0,1132811754176765952,nutman71234668,quaintqueef420,,26,509,i smelly
1,1570075902817583106,DemoLoversMCR,Demolition Lovers Gang - MCR,"Newark, New Jersey",1,6,Official petition. We want to hear Demoition L...
2,1537290582556454914,Meooowcy,CréamyLatté,"Lungsod ng Valenzuela, Pambans",2,65,
3,1232241829191593984,jamieexisted,jamie,"he/him, 19",33,310,🏳️‍🌈🏳️‍⚧️
4,1570087687683555328,KarmenWeaks,Karmen Weaks,,0,89,


### Artist 2: Missy Elliott @MissyElliott

In [18]:
# note the artists' twitter handles
handles = ['MissyElliott']
handle_followers = []

whitespace_pattern = re.compile(r"\s+")

user_data = dict() 
followers_data = dict()

for handle in handles :
    user_obj = client.get_user(username=handle) # this will specify the handles
    user_data[handle] = [] # will be a list of lists
    followers_data[handle] = [] # will be a simple list of IDs


# Grabs the time when we start making requests to the API
start_time = datetime.datetime.now()

for handle in handles :
    
    # Create the output file names 
    
    followers_output_file = handle + "_followers.txt"
    user_data_output_file = handle + "_follower_data.txt"
    
    # this produces a print statement indicating artist
    print(f'Pulling followers for {handle}.')
    
    # Using tweepy.Paginator (https://docs.tweepy.org/en/latest/v2_pagination.html), 
    # use `get_users_followers` to pull the follower data requested. 
    
    for followers in tweepy.Paginator(client.get_users_followers,
            user_obj.data.id,
            user_fields=["username","name","id","location","public_metrics","description"],
            max_results=1000,
            limit=100):
        print(followers.meta)
        
        for follower in followers.data:
            follower_row = {'id': follower.id,
                            'username': follower.username,
                            'name': follower.name,
                            'location': follower.location,
                            'follower_count': follower.public_metrics['followers_count'],
                            'following': follower.public_metrics['following_count'],
                            'description': follower.description
                           }
            user_data[handle].append(follower_row) # this appends all follower data
            followers_data[handle].append(follower.id) # this appends only follower ids
            
    # For each response object, extract the needed fields and store them in a dictionary or
    # data frame. 
    users_output_missy = pd.DataFrame(user_data[handle], columns=['id','username',
                                                               'name', 'location',
                                                               'follower_count', 'following',
                                                               'description'])
    followers_output_missy = pd.DataFrame(user_data[handle], columns=['id'])
    
    users_output_missy.to_csv(f'twitter/{handle}_followers.txt', sep='\t', index=False)
    followers_output_missy.to_csv(f'twitter/{handle}_follower_data.txt', sep='\t', index=False)
            
    print(f'Completed pulling follower data for {handle}.')
        
# Let's see how long it took to grab all follower IDs
end_time = datetime.datetime.now()
print(f'Time taken was {end_time - start_time}.')


Pulling followers for MissyElliott.
{'result_count': 1000, 'next_token': 'PR0OT22HEGPHGZZZ'}
{'result_count': 1000, 'next_token': '2RLBKQHC24PHGZZZ', 'previous_token': 'VFN8GGDJHF6EEZZZ'}
{'result_count': 1000, 'next_token': 'B8IHKHNLJKP1GZZZ', 'previous_token': 'G73LE512TV6EEZZZ'}
{'result_count': 1000, 'next_token': '081HS2L164P1GZZZ', 'previous_token': 'L0UVGV8ECB6UEZZZ'}
{'result_count': 1000, 'next_token': 'P7OUIVTMS8OHGZZZ', 'previous_token': 'QG0RQLR3PR6UEZZZ'}
{'result_count': 1000, 'next_token': 'E219JO4LHCOHGZZZ', 'previous_token': 'H7O68KJ13N7EEZZZ'}
{'result_count': 1000, 'next_token': 'Q8O93CEIVSO1GZZZ', 'previous_token': 'PIMLKFNOEJ7EEZZZ'}
{'result_count': 1000, 'next_token': '3EDAKLVPK8O1GZZZ', 'previous_token': 'OP6BVH9E037UEZZZ'}
{'result_count': 1000, 'next_token': 'D9U2TA3J84O1GZZZ', 'previous_token': 'U39LAF0EBN7UEZZZ'}
{'result_count': 1000, 'next_token': 'F3DKAL9UOGNHGZZZ', 'previous_token': 'E5HF1TOFNV7UEZZZ'}


Rate limit exceeded. Sleeping for 892 seconds.


{'result_count': 1000, 'next_token': 'N352RRUECONHGZZZ', 'previous_token': '1VUO8D6O7F8EEZZZ'}
{'result_count': 1000, 'next_token': '2132R2I11GNHGZZZ', 'previous_token': 'VJUBEMI9J78EEZZZ'}
{'result_count': 1000, 'next_token': '36BD4PF8KCN1GZZZ', 'previous_token': '1EJSRTE4UF8EEZZZ'}
{'result_count': 1000, 'next_token': 'KH9KCNG9B0N1GZZZ', 'previous_token': 'AF9VITP2BJ8UEZZZ'}
{'result_count': 1000, 'next_token': '70A6A0FU2CN1GZZZ', 'previous_token': 'A2ENKF8CL38UEZZZ'}
{'result_count': 1000, 'next_token': '3R2OJLAVI8MHGZZZ', 'previous_token': 'SUD3838KTJ8UEZZZ'}
{'result_count': 999, 'next_token': 'SS133RHO80MHGZZZ', 'previous_token': '9TQB01FFDN9EEZZZ'}
{'result_count': 999, 'next_token': 'SIG9B7MB2KMHGZZZ', 'previous_token': '1DEQ65E9NV9EEZZZ'}
{'result_count': 1000, 'next_token': 'C6FPHE6SN8M1GZZZ', 'previous_token': 'UFHRI41STB9EEZZZ'}
{'result_count': 999, 'next_token': '0KO77Q9J58M1GZZZ', 'previous_token': 'DHBDISPA8N9UEZZZ'}
{'result_count': 1000, 'next_token': '957OTJSJICLHGZZ

Rate limit exceeded. Sleeping for 893 seconds.


{'result_count': 999, 'next_token': 'QTBOC4N8L4KHGZZZ', 'previous_token': '47S4V025R7AUEZZZ'}
{'result_count': 1000, 'next_token': 'OF66I7FG3KKHGZZZ', 'previous_token': 'N55PVR16ARBEEZZZ'}
{'result_count': 1000, 'next_token': 'T2O9QCNPFOK1GZZZ', 'previous_token': '18F7Q5HBSBBEEZZZ'}
{'result_count': 1000, 'next_token': '339PJV0I1CK1GZZZ', 'previous_token': 'CNR2CK9JG7BUEZZZ'}
{'result_count': 1000, 'next_token': 'GS0B7K21KSJHGZZZ', 'previous_token': 'QTKRMUVOUJBUEZZZ'}
{'result_count': 999, 'next_token': 'V8G9MKI818JHGZZZ', 'previous_token': 'U8S8PPH5B7CEEZZZ'}
{'result_count': 1000, 'next_token': '3IF8V5L6COJ1GZZZ', 'previous_token': 'C4SV62M2UNCEEZZZ'}
{'result_count': 1000, 'next_token': '7CQGGHTOOSIHGZZZ', 'previous_token': 'G2QDSUIRJ7CUEZZZ'}
{'result_count': 1000, 'next_token': '4CIMF5RS5OIHGZZZ', 'previous_token': '6ANKLH2873DEEZZZ'}
{'result_count': 1000, 'next_token': 'HPRK7P0VL8I1GZZZ', 'previous_token': 'FK824E57Q7DEEZZZ'}
{'result_count': 999, 'next_token': 'UJERKO8R1KI1GZZ

Rate limit exceeded. Sleeping for 893 seconds.


{'result_count': 1000, 'next_token': 'PF11D68J2GGHGZZZ', 'previous_token': 'JCUF2I3887FEEZZZ'}
{'result_count': 1000, 'next_token': 'BIAA9E56GSG1GZZZ', 'previous_token': 'C1UTT9G4TJFEEZZZ'}
{'result_count': 1000, 'next_token': 'RM1JIF9Q6OG1GZZZ', 'previous_token': '3F6IBHR2F3FUEZZZ'}
{'result_count': 1000, 'next_token': '8L25BQGAJSFHGZZZ', 'previous_token': 'RAB9O2U9P7FUEZZZ'}
{'result_count': 1000, 'next_token': '8QE0EQ5A64FHGZZZ', 'previous_token': 'V6T6QT8GC7GEEZZZ'}
{'result_count': 1000, 'next_token': 'N8N9O2HBRGF1GZZZ', 'previous_token': '49K3NQJFPRGEEZZZ'}
{'result_count': 1000, 'next_token': 'NKNU7U3OD4F1GZZZ', 'previous_token': 'PDOJA1ES4FGUEZZZ'}
{'result_count': 999, 'next_token': 'V9DMDHQ70GF1GZZZ', 'previous_token': '2QR2A2L8IRGUEZZZ'}
{'result_count': 999, 'next_token': 'V8M3J8DLLKEHGZZZ', 'previous_token': 'DHSDGK66VFGUEZZZ'}
{'result_count': 1000, 'next_token': '8D098PBKG4EHGZZZ', 'previous_token': '9CHLURIEABHEEZZZ'}
{'result_count': 1000, 'next_token': 'K7NV30SC60EHGZ

Rate limit exceeded. Sleeping for 893 seconds.


{'result_count': 1000, 'next_token': 'VDMNPIKHP0D1GZZZ', 'previous_token': 'P8NILL9JJNIEEZZZ'}
{'result_count': 1000, 'next_token': '91132BV244D1GZZZ', 'previous_token': 'C4MGTMD86VIUEZZZ'}
{'result_count': 1000, 'next_token': 'EQJ1GMQNH8CHGZZZ', 'previous_token': '48V2L617RRIUEZZZ'}
{'result_count': 1000, 'next_token': 'USGD4PJ01SCHGZZZ', 'previous_token': 'T261F9LIENJEEZZZ'}
{'result_count': 1000, 'next_token': 'V8BA3TOCBGC1GZZZ', 'previous_token': 'DIR9UAT6U3JEEZZZ'}
{'result_count': 1000, 'next_token': 'D0E4U1DASGBHGZZZ', 'previous_token': 'OFMA9SGAKJJUEZZZ'}
{'result_count': 1000, 'next_token': 'CGJCJ6BRF8BHGZZZ', 'previous_token': 'RO8N2OIQ3FKEEZZZ'}
{'result_count': 1000, 'next_token': '0FSMKBGDLKB1GZZZ', 'previous_token': 'MSLKVSV0GNKEEZZZ'}
{'result_count': 1000, 'next_token': 'GPNO8KAC00B1GZZZ', 'previous_token': 'MATAA8M4AFKUEZZZ'}
{'result_count': 1000, 'next_token': '64UJGRVQ4KAHGZZZ', 'previous_token': '61FR1ON7VVKUEZZZ'}
{'result_count': 1000, 'next_token': '89PH7APJJ8A1

Rate limit exceeded. Sleeping for 893 seconds.


{'result_count': 1000, 'next_token': 'UOAP3CSO9491GZZZ', 'previous_token': 'CBCJUP4G0VMUEZZZ'}
{'result_count': 1000, 'next_token': 'LO2QSH7MMK8HGZZZ', 'previous_token': 'OILCP5JGMRMUEZZZ'}
{'result_count': 1000, 'next_token': 'S12U68E6PO81GZZZ', 'previous_token': '4EJBTR179BNEEZZZ'}
{'result_count': 1000, 'next_token': 'KN8273PE6O81GZZZ', 'previous_token': 'IIT9B4IM67NUEZZZ'}
{'result_count': 1000, 'next_token': 'RHHOOFUTI47HGZZZ', 'previous_token': 'OC958O00PBNUEZZZ'}
{'result_count': 1000, 'next_token': 'B75A7LP4Q471GZZZ', 'previous_token': 'S2UEU72CDROEEZZZ'}
{'result_count': 1000, 'next_token': 'ODMD2EHG3S71GZZZ', 'previous_token': 'JVN65L0D5VOUEZZZ'}
{'result_count': 1000, 'next_token': 'KSDA33939K6HGZZZ', 'previous_token': 'L9P1UTMOS3OUEZZZ'}
{'result_count': 1000, 'next_token': '3EPLP5T4NG61GZZZ', 'previous_token': 'MK6R0INBMBPEEZZZ'}
{'result_count': 1000, 'next_token': '5K8P5IOP7C61GZZZ', 'previous_token': '5FNFV7LI8FPUEZZZ'}
{'result_count': 1000, 'next_token': '7MUBRIF5K05H

Rate limit exceeded. Sleeping for 893 seconds.


{'result_count': 1000, 'next_token': 'UCE4PJIGRO41GZZZ', 'previous_token': 'U8K7G9U0L7REEZZZ'}
{'result_count': 1000, 'next_token': 'QB9MSOIS9K41GZZZ', 'previous_token': 'SS639UM947RUEZZZ'}
{'result_count': 1000, 'next_token': 'P130RSPHP03HGZZZ', 'previous_token': 'MIMVEUTLMBRUEZZZ'}
{'result_count': 1000, 'next_token': '4S93O60V903HGZZZ', 'previous_token': '9A544PV06VSEEZZZ'}
{'result_count': 1000, 'next_token': '72CS6HEJ083HGZZZ', 'previous_token': '9HB3SOFTMVSEEZZZ'}
{'result_count': 1000, 'next_token': '3DS6E4COKG31GZZZ', 'previous_token': '5OAOVDIAVNSEEZZZ'}
{'result_count': 1000, 'next_token': 'B2E510U98431GZZZ', 'previous_token': '0CSEIG50BFSUEZZZ'}
{'result_count': 1000, 'next_token': 'D1UAVQBNNS2HGZZZ', 'previous_token': '0T6FR6AINRSUEZZZ'}
{'result_count': 1000, 'next_token': 'EUD54KII7K2HGZZZ', 'previous_token': '8FM5EVTV83TEEZZZ'}
{'result_count': 1000, 'next_token': 'LSA4UCI3RO21GZZZ', 'previous_token': '0NHGTODOOBTEEZZZ'}
{'result_count': 1000, 'next_token': 'C3B520HNIK21

In [19]:
# checking dataframes to ensure proper data pull
users_output_missy.head()

Unnamed: 0,id,username,name,location,follower_count,following,description
0,1570115372401008642,SantSal0902,Santi,,0,74,
1,1259826247900770305,thePeaceFr0g,The_Peace Frog🐸,Pretoria Gauteng South Africa,204,1221,°|Motivational Speaker|°🙏\nGraphics|🐰designer|...
2,1386438624,campos_rosalina,Rixxgirl,"Porterville, California",16,114,"Love zombies, old horror flicks, some dramas a..."
3,1569314420135706624,BrownSkinn_DD,Destyne Lewis,,1,66,
4,1197427621891629056,monkeybearcares,MONKEYBEARCARES,,71,431,


In [20]:
tricky_description = """
    Home by Warsan Shire
    
    no one leaves home unless
    home is the mouth of a shark.
    you only run for the border
    when you see the whole city
    running as well.

"""
# This won't work in a tab-delimited text file.

clean_description = re.sub(r"\s+"," ",tricky_description).strip()
clean_description

'Home by Warsan Shire no one leaves home unless home is the mouth of a shark. you only run for the border when you see the whole city running as well.'

# Lyrics Scrape

This section asks you to pull data from the Twitter API and scrape www.AZLyrics.com. In the notebooks where you do that work you are asked to store the data in specific ways. 

In [34]:
artists = {'mychemicalromance':"https://www.azlyrics.com/m/mychemicalromance.html",
           'missy':"https://www.azlyrics.com/m/missy.html"} 
# we'll use this dictionary to hold both the artist name and the link on AZlyrics

## A Note on Rate Limiting

The lyrics site, www.azlyrics.com, does not have an explicit maximum on number of requests in any one time, but in our testing it appears that too many requests in too short a time will cause the site to stop returning lyrics pages. (Entertainingly, the page that gets returned seems to only have the song title to [a Tom Jones song](https://www.azlyrics.com/lyrics/tomjones/itsnotunusual.html).) 

Whenever you call `requests.get` to retrieve a page, put a `time.sleep(5 + 10*random.random())` on the next line. This will help you not to get blocked. If you _do_ get blocked, which you can identify if the returned pages are not correct, just request a lyrics page through your browser. You'll be asked to perform a CAPTCHA and then your requests should start working again. 

## Part 1: Finding Links to Songs Lyrics

That general artist page has a list of all songs for that artist with links to the individual song pages. 

Q: Take a look at the `robots.txt` page on www.azlyrics.com. (You can read more about these pages [here](https://developers.google.com/search/docs/advanced/robots/intro).) Is the scraping we are about to do allowed or disallowed by this page? How do you know? 

A: Taking a look at the page www.azlyrics.com/robots.txt, it appears that we are allowed to scrape the lyrics. The only limitations are pages with the following paths:

/lyricsdb/

/song/

Whether intentional or not, it appears that the path /song/ directs us to Rick Astley's "Never Gonna Give You Up".


In [35]:
# Let's set up a dictionary of lists to hold our links
lyrics_pages = defaultdict(list)

# see https://www.geeksforgeeks.org/extract-all-the-urls-from-the-webpage-using-python/ for additional info

# we've already specified our "url" in artists
for artist, artist_page in artists.items() :
    # request the page and sleep
    r = requests.get(artist_page)
    time.sleep(5 + 10*random.random())

    # now extract the links to lyrics pages from this page
    urls = []
    soup = BeautifulSoup(r.text, 'html.parser')
    for link in soup.find_all('a'):
        if ("lyrics/" + artist) in str(link.get('href')):
            urls.append('https://www.azlyrics.com/' + str(link.get('href')))
    # store the links `lyrics_pages` where the key is the artist and the
    # value is a list of links. 
    lyrics_pages[artist] = urls
    

In [80]:
lyrics_pages['mychemicalromance']

['https://www.azlyrics.com//lyrics/mychemicalromance/honeythismirrorisntbigenoughforthetwoofus.html',
 'https://www.azlyrics.com//lyrics/mychemicalromance/vampireswillneverhurtyou.html',
 'https://www.azlyrics.com//lyrics/mychemicalromance/drowninglessons.html',
 'https://www.azlyrics.com//lyrics/mychemicalromance/ourladyofsorrows.html',
 'https://www.azlyrics.com//lyrics/mychemicalromance/headfirstforhalos.html',
 'https://www.azlyrics.com//lyrics/mychemicalromance/skylinesandturnstiles.html',
 'https://www.azlyrics.com//lyrics/mychemicalromance/earlysunsetsovermonroeville.html',
 'https://www.azlyrics.com//lyrics/mychemicalromance/thisisthebestdayever.html',
 'https://www.azlyrics.com//lyrics/mychemicalromance/cubicles.html',
 'https://www.azlyrics.com//lyrics/mychemicalromance/demolitionlovers.html',
 'https://www.azlyrics.com//lyrics/mychemicalromance/helena.html',
 'https://www.azlyrics.com//lyrics/mychemicalromance/giveemhellkid.html',
 'https://www.azlyrics.com//lyrics/mychemica

Let's make sure we have enough lyrics pages to scrape. 

In [38]:
for artist, lp in lyrics_pages.items() :
    assert(len(set(lp)) > 20) 

In [57]:
# Let's see how long it's going to take to pull these lyrics 
# if we're waiting `5 + 10*random.random()` seconds 
for artist, links in lyrics_pages.items() : 
    print(f"For {artist} we have {len(links)} songs.")
    print(f"The full pull for this artist will take {round(len(links)*10/3600,2)} hours.")

For mychemicalromance we have 104 songs.
The full pull for this artist will take 0.29 hours.
For missy we have 125 songs.
The full pull for this artist will take 0.35 hours.


## Part 2: Pulling Lyrics

Now that we have the links to our lyrics pages, let's go scrape them! Here are the steps for this part. 

1. Create an empty folder in our repo called "lyrics". 
1. Iterate over the artists in `lyrics_pages`. 
1. Create a subfolder in lyrics with the artist's name. For instance, if the artist was Cher you'd have `lyrics/cher/` in your repo.
1. Iterate over the pages. 
1. Request the page and extract the lyrics from the returned HTML file using BeautifulSoup.
1. Use the function below, `generate_filename_from_url`, to create a filename based on the lyrics page, then write the lyrics to a text file with that name. 


In [86]:
def generate_filename_from_link(link) :
    
    if not link :
        return None
    
    # drop the http or https and the html
    name = link.replace("https","").replace("http","")
    name = link.replace(".html","")

    name = name.replace("/lyrics/","")
    name = name.replace("https://www.azlyrics.com/", "")
    
    # Replace useless chareacters with UNDERSCORE
    name = name.replace("://","").replace(".","_").replace("/","_")
    
    # tack on .txt
    name = name + ".txt"
    
    return(name)


In [87]:
# Make the lyrics folder here. If you'd like to practice your programming, add functionality 
# that checks to see if the folder exists. If it does, then use shutil.rmtree to remove it and create a new one.

if not os.path.isdir("lyrics") : 
    shutil.rmtree("lyrics/")

    os.mkdir("lyrics")

In [89]:
url_stub = "https://www.azlyrics.com" 
start = time.time()

total_pages = 0 

for artist in lyrics_pages :

    # Use this space to carry out the following steps: 
    
    # 1. Build a subfolder for the artist
    
    # follow example for creating lyrics folder
    if os.path.isdir("lyrics/" + artist):
        shutil.rmtree("lyrics/" + artist)
    
    os.mkdir("lyrics/" + artist)
    path = "lyrics/" + artist
        
    # 2. Iterate over the lyrics pages
    for song in lyrics_pages[artist]:
        
    # 3. Request the lyrics page. 
        # Don't forget to add a line like `time.sleep(5 + 10*random.random())`
        # to sleep after making the request
        lyric_page = requests.get(song)
        time.sleep(5+10*random.random())
        soup_lyrics = BeautifulSoup(lyric_page.content)
        
    # 4. Extract the title and lyrics from the page.
        song_name = soup_lyrics.find("b").get_text()
        song_lyrics = soup_lyrics.find("div", class_=False, id=False).get_text()
        
    # 5. Write out the title, two returns ('\n'), and the lyrics. Use `generate_filename_from_url`
        # to generate the filename. 
        the_song = song_name + '\n' + '\n' + song_lyrics
        to_folder = os.path.join(path, generate_filename_from_link(song))
        file = open(to_folder, 'w')
        file.write(the_song)
    
    # Remember to pull at least 20 songs per artist. It may be fun to pull all the songs for the artist
    

In [90]:
print(f"Total run time was {round((time.time() - start)/3600,2)} hours.")

Total run time was 0.67 hours.


---

# Evaluation

This assignment asks you to pull data from the Twitter API and scrape www.AZLyrics.com.  After you have finished the above sections , run all the cells in this notebook. Print this to PDF and submit it, per the instructions.

In [91]:
# Simple word extractor from Peter Norvig: https://norvig.com/spell-correct.html
def words(text): 
    return re.findall(r'\w+', text.lower())

---

## Checking Twitter Data

The output from your Twitter API pull should be two files per artist, stored in files with formats like `cher_followers.txt` (a list of all follower IDs you pulled) and `cher_followers_data.txt`. These files should be in a folder named `twitter` within the repository directory. This code summarizes the information at a high level to help the instructor evaluate your work. 

In [92]:
twitter_files = os.listdir("twitter")
twitter_files = [f for f in twitter_files if f != ".DS_Store"]
artist_handles = list(set([name.split("_")[0] for name in twitter_files]))

print(f"We see two artist handles: {artist_handles[0]} and {artist_handles[1]}.")

We see two artist handles: MCRofficial and MissyElliott.


In [95]:
# when pulling twitter follower data, I had switched the file names
# _followers contains all follower data
# _follower_data contains only the IDs
# will modify the evaluation code below to reflect this

for artist in artist_handles :
    follower_file = artist + "_follower_data.txt"
    follower_data_file = artist + "_followers.txt"
    
    ids = open("twitter/" + follower_file,'r').readlines()
    
    print(f"We see {len(ids)-1} in your follower file for {artist}, assuming a header row.")
    
    with open("twitter/" + follower_data_file,'r') as infile :
        
        # check the headers
        headers = infile.readline().split("\t")
        
        print(f"In the follower data file ({follower_data_file}) for {artist}, we have these columns:")
        print(" : ".join(headers))
        
        description_words = []
        locations = set()
        
        
        for idx, line in enumerate(infile.readlines()) :
            line = line.strip("\n").split("\t")
            
            try : 
                locations.add(line[3])            
                description_words.extend(words(line[6]))
            except :
                pass
    
        

        print(f"We have {idx+1} data rows for {artist} in the follower data file.")

        print(f"For {artist} we have {len(locations)} unique locations.")

        print(f"For {artist} we have {len(description_words)} words in the descriptions.")
        print("Here are the five most common words:")
        print(Counter(description_words).most_common(5))

        
        print("")
        print("-"*40)
        print("")
    

We see 99996 in your follower file for MCRofficial, assuming a header row.
In the follower data file (MCRofficial_followers.txt) for MCRofficial, we have these columns:
id : username : name : location : follower_count : following : description

We have 128407 data rows for MCRofficial in the follower data file.
For MCRofficial we have 26372 unique locations.
For MCRofficial we have 563428 words in the descriptions.
Here are the five most common words:
[('i', 14708), ('and', 9948), ('a', 8681), ('the', 7882), ('to', 5828)]

----------------------------------------

We see 99992 in your follower file for MissyElliott, assuming a header row.
In the follower data file (MissyElliott_followers.txt) for MissyElliott, we have these columns:
id : username : name : location : follower_count : following : description

We have 114991 data rows for MissyElliott in the follower data file.
For MissyElliott we have 14232 unique locations.
For MissyElliott we have 476022 words in the descriptions.
Here

Note that the number of data rows between follower and follower_data files for each respective artist does not match. I suspect I may have miscoded the twitter IDs, thus generating additional entries. 

## Checking Lyrics 

The output from your lyrics scrape should be stored in files located in this path from the directory:
`/lyrics/[Artist Name]/[filename from URL]`. This code summarizes the information at a high level to help the instructor evaluate your work. 

In [96]:
artist_folders = os.listdir("lyrics/")
artist_folders = [f for f in artist_folders if os.path.isdir("lyrics/" + f)]

for artist in artist_folders : 
    artist_files = os.listdir("lyrics/" + artist)
    artist_files = [f for f in artist_files if 'txt' in f or 'csv' in f or 'tsv' in f]

    print(f"For {artist} we have {len(artist_files)} files.")

    artist_words = []

    for f_name in artist_files : 
        with open("lyrics/" + artist + "/" + f_name) as infile : 
            artist_words.extend(words(infile.read()))

            
    print(f"For {artist} we have roughly {len(artist_words)} words, {len(set(artist_words))} are unique.")


For mychemicalromance we have 104 files.
For mychemicalromance we have roughly 33519 words, 2320 are unique.
For missy we have 125 files.
For missy we have roughly 59377 words, 5036 are unique.
