## Twitter Text Data Gathering Report

#### By Calvin Muller

#### October 20, 2021

In this report, I am using the Twitter API v2 to create a query to gather a series of tweets that may be helpful in answering a question I have about music release dates. The question I have is this: does the day that an album releases significantly affect how well it does, and how many people talk about it? Albums traditionally release at midnight of Friday, however there have been some well known artists that like to switch up the status quo and release albums at odd times (most notably of late Kanye releasing his highly anticipated "Donda" at 9am on a Sunday). Is this conducive to more talk about the album, more publicity? That's what I hope gathering this data might be able to help with.

The query I developed will gather all recent tweets that are talking about an album, and what day that album is going to release. I've used a few common synonyms for "releasing" in hopes to get a wide variety of tweets. My hope is that these tweets will show which albums people are talking about more, as well as how many albums go the untraditional route of releasing on less-common days.

In [1]:
import json
import pandas as pd
import requests
import urllib

In [2]:
bearer_token = pd.read_csv('twitter_bearer_token.txt', header = 0)

In [3]:
header = {'Authorization' : 'Bearer {}'.format(bearer_token['Bearer_Token'].iloc[0])}

In the above cells, I am using my bearer token that Twitter has given me, once they approved my use case for this project. The point of this is to authenticate to Twitter that I am verified and allowed to access this data at the 'Standard' level that I have been approved.

In [4]:
endpoint_url = 'https://api.twitter.com/2/tweets/search/recent'

In [5]:
query = urllib.parse.quote('(album (Friday OR Thursday OR Monday OR Tuesday OR Wednesday OR Saturday OR Sunday) (dropping OR release OR releasing OR (coming out)) lang:en -is:retweet)')

In [6]:
tweet_fields = 'public_metrics,created_at,author_id,conversation_id'

In [7]:
url = endpoint_url + '?query={}&max_results=100&tweet.fields={}&user.fields={}&expansions={}'.format(query, tweet_fields, 'username', 'author_id')

In the above four cells, I am creating my query. First you must start out with the endpoint URL that Twitter provides in their documentation. Next, I am creating the query and using a package called 'urllib' that will take the readable query that I have created and turn it into html text that Twitter will be able to understand and use in my request. I then finalize the URL that I am sending twitter by asking for specific tweet fields that I want this data to show -- I decided to add in conversation_id as a field, so I knew if people were talking about the same albums in the same thread, or if these are unique tweets being created.

In [8]:
response_1 = requests.request("GET", url, headers = header)

In [9]:
response_1_dict = json.loads(response_1.text)

#### Sending the request


Above, this is the the actual request that I am sending to twitter, using the URL and header I have already defined. Then I am turning the data that Twitter sends back into a JSON that I can parse using Python. The raw data structure that Twitter sends, while you may be able to look at it and grab some things out, can't be utilized with these Python techniques, without it being turned into a JSON list or dictionary.

In [10]:
user_info = pd.DataFrame(response_1_dict['includes']['users'])

In [11]:
my_df = pd.DataFrame(response_1_dict['data'])

In [12]:
my_df['name'] = user_info['name']
my_df['username'] = user_info['username']

In [None]:
my_df

#### Adding columns to a dataframe

Above, I am adding both the username and name columns to what would be the default pandas dataframe of the json that Twitter has given me. In order to do this, I am making another dataframe with the 'includes' key so that I can get user data into a dataframe. Once I've done that (and created the default dataframe with 'data'), I then go ahead and add both 'name' and 'username' columns to the same dataframe. Throughout the rest of the code, I will repeat that process, and will refer back to this markdown cell.

In [14]:
url_2 = url + '&next_token={}'.format(response_1_dict['meta']['next_token'])

#### Pagination

Here I am using a technique called pagination, to get more results from my request. Twitter only allows up to 100 results per query to show up, however it gives you access to more -- just buried under this process. Under the 'meta' key, it gives you a piece of data called 'next_token', and when you add this token on top of the URL you have built, it will go to the next 'page' of data, and show you those next 100 results. This process can be repeated and I will do it one more time, but you must clear the old 'next_token' in order for it to work.

In [15]:
response_2 = requests.request("GET", url_2, headers = header)

In [16]:
response_2_dict = json.loads(response_2.text)

In [17]:
user_info_2 = pd.DataFrame(response_2_dict['includes']['users'])

In [18]:
my_df2 = pd.DataFrame(response_2_dict['data'])

In [19]:
my_df2['name'] = user_info_2['name']
my_df2['username'] = user_info_2['username']

In [None]:
my_df2

Above, I am repeating the processes I have already explained in sending the request and adding columns to the data frame. However, just creating new variables to signify that this is the second dataframe that I am trying to create.

In [21]:
url_3 = url + '&next_token={}'.format(response_2_dict['meta']['next_token'])

In [22]:
response_3 = requests.request("GET", url_3, headers = header)

In [23]:
response_3_dict = json.loads(response_3.text)

In [24]:
user_info_3 = pd.DataFrame(response_3_dict['includes']['users'])

In [25]:
my_df3 = pd.DataFrame(response_3_dict['data'])

In [26]:
my_df3['name'] = user_info_3['name']
my_df3['username'] = user_info_3['username']

In [None]:
my_df3

For the third time, I have repeated the process of creating a unique URL with pagination, sending the request, and then adding in the additional columns that I want to be shown.

In [28]:
int_df = my_df.append(my_df2)

In [29]:
fin_df = int_df.append(my_df3)

In [None]:
fin_df

In these above cells, I am appending the three dataframes I have created into a final dataframe named 'fin_df'.

In [31]:
fin_df.to_csv("Twitter_album_data.csv")

Finally, I am exporting this final dataframe to a CSV file, to be viewed on Excel.

### Final Reflection and Assessment

Looking at the data that I have gathered from Twitter, I think that it for the most part accomplishes what I set out to gather. Looking at the text from the tweets, it seems that these are all the types of tweets that I was looking for: tweets anticipating the release of a new album. Having the conversation ID would also be helpful in later steps of analysis, for example if I were to take this into a Pivot Table in excel, I could group these tweets by conversation ID to look at which tweets are generating the most in-thread responses.

However, there may still be some limitations in this data. For one, it doesn't take into account the sentiment of these conversations that are happening: how people feel about these albums. Also, because it only runs once, and not on a continuous basis, the data is likely influenced by what day this query is run, which could bias the results.

An alternative approach that I may want to take in the future, is to create a function for this process which would repeat itself everyday, so that the data gathered would remove bias in that sense. I would also likely want to create a function to simply repeat the steps of creating the URL, sending the request, and adding columns to the dataframe. This would save many lines of code, and would be easier to repeat this pagination process as much as I want.

Editors note: I have cleared the output of the dataframes in the final submission, as to save space when reading through the code, because Github does not condense dataframes like Jupyter Notebook does.