# Twitter Text Report: Donda #

Are albums that are highly anticipated doomed to fail? One way to answer this question is by gathering public sentiment about an album before and after it is released through tweets. I will look at Kanye's album Donda as an example, as Kanye is a polarizing musician that is widely talked leading, leading to high anticipation for this music and a plethora of tweets to work with. 

We start by importing the proper libraries.

In [2]:
import pandas as pd
import requests
import json
import urllib

Read in keys and create header

In [3]:
keys = pd.read_csv('twittertest.txt', sep = '\t', header = None)

bToken = keys[0][2]

header = {'Authorization' : 'Bearer {}'.format(bToken)}

### Create Query and Url ###

The url I created in order to obtain relevant tweets includes the #donda OR #kanye hashtags. 

Originally, my preferred album to explore was Solar Power by Lorde, but this became a challenge when #solarpower yielded results that were not relevant and #solarpower AND #lorde did not yield enough results. Although I did not understand why this was the case, as I had checked by using the Solar Power query at twitter.com and had more results than 30, only 30 were pulled into this report.

Due to this hiccup, I used another album I knew came out recently: Donda. Donda avoided the problem of my Solar Power query as Donda is specific to that album and Solar Power describes another concept. Since Donda was specific, I could do a query using OR, broadening the search and yielding enough results in order to complete the pagination portion of the report. I also decided to use OR instead of AND since people may say "Kanye's new album" without dropping it by name. He is such a recognizable figure in music that "Kanye's new album" is sufficient in letting everyone know which album you are talking about. 

That artist behind Donda was also a reason to choose it for analysis. Kanye's career and lifestyle has many people talking about their opinions of not only his music, but of him in general. This means that I am sure to get solid and varied public sentiment about Donda, since Kanye generates conversations. A month before the album came out, NPR released the article __["From a Small House in a Big Stadium, Kanye Comes Up Empty Handed."](https://www.npr.org/2021/09/02/1033424732/from-a-small-house-in-a-big-stadium-kanye-comes-up-empty-handed)__ The fact that NPR wrote an article on this album before it was released shows it was highly anticipated and that there was public sentiment brewing before the release. With these two qualifiers met, I decided Donda would be the album I used to answer my question. 

Obviously, the answer to this question would require many different data pulls from many different album releases, but Donda is the start. 

##### Data Structure #####

As for the fields to include in the data pulled, I chose to include verified users in order to see if there was a difference in the way verified accounts, which are usually other celebrities, talk about Kanye's album versus non verified accounts. The other fields are broken into tweet_fields and user_fields and include name, Twitter handle (username), time tweeted (created_at), and engagement (public metrics. Tweet ID does not need to be called as it is automatically pulled as a primary key. 

In [4]:
base_url = 'https://api.twitter.com/2/tweets/search/recent'

query = urllib.parse.quote('(#donda OR #kanye) lang:en')

tweet_fields = 'public_metrics,created_at,author_id,lang'
user_fields = 'name,username,verified'
expansions = 'author_id'

api_url = base_url + '?query={}&max_results=100&tweet.fields={}&expansions={}&user.fields={}'.format(query, tweet_fields, expansions, user_fields)

### Get Tweets ###

The tweets are pulled based on the query created previously. Once I have them, I can put them into a DataFrame. I am able to do this by using the raw text file (response.test) and converting that to a json, which is then converted to a DataFrame. The data in the json is surrounded in an all encompassing key called "data", so we must specify we want the information inside the data key when creating the DataFrame or else we would only get one entry in our DataFrame. 

In [5]:
response = requests.request("GET", api_url, headers = header)

response_dict = json.loads(response.text)
my_df = pd.DataFrame(response_dict['data'])

### Pagination ###

In [6]:
api_url_2 = api_url + '&next_token={}'.format(response_dict['meta']['next_token'])

response_2 = requests.request("GET", api_url_2, headers = header)

my_df = my_df.append(pd.DataFrame(json.loads(response_2.text)['data']), ignore_index = True)

##### Final Pagination 300 #####

In [7]:
api_url_2 = api_url + '&next_token={}'.format(response_dict['meta']['next_token'])

response_2 = requests.request("GET", api_url_2, headers = header)

my_df = my_df.append(pd.DataFrame(json.loads(response_2.text)['data']), ignore_index = True)

### To CSV  and Small Visualization ###

The first line shows the data going to a csv on my desktop.

The second line shows a snapshot of the first 20 tweets out of the 300 I collected on Kanye's Donda.

In [8]:
my_df.to_csv("twitter_donda_data.csv")

In [12]:
my_df.head(20)

Unnamed: 0,id,author_id,text,lang,created_at,public_metrics
0,1450901005902749708,1421889444051881984,RT @SteHodl: Great entry for $DZERO right now ...,en,2021-10-20T19:05:30.000Z,"{'retweet_count': 6, 'reply_count': 0, 'like_c..."
1,1450899666422358027,54124958,Hold up! #Wayaminute! Can anyone explain what'...,en,2021-10-20T19:00:11.000Z,"{'retweet_count': 0, 'reply_count': 0, 'like_c..."
2,1450899589725233153,1274417852959739904,featuring @kanyewest \n\n#donda\n#kanyewest ht...,en,2021-10-20T18:59:53.000Z,"{'retweet_count': 0, 'reply_count': 0, 'like_c..."
3,1450897597082841093,1450810669733924865,Vibing to “Violent Crimes” by #Kanye at work m...,en,2021-10-20T18:51:58.000Z,"{'retweet_count': 0, 'reply_count': 0, 'like_c..."
4,1450895780072546311,486057278,RT @star106: Coming Up @onwithmario – @mariolo...,en,2021-10-20T18:44:44.000Z,"{'retweet_count': 1, 'reply_count': 0, 'like_c..."
5,1450894631470776326,1412733693911781377,RT @prince_obiukwu: #freemaziNnadikunanow \n#f...,en,2021-10-20T18:40:11.000Z,"{'retweet_count': 6, 'reply_count': 0, 'like_c..."
6,1450894570854789124,304608661,RT @niyahbabii91: I know yall HATE this man bu...,en,2021-10-20T18:39:56.000Z,"{'retweet_count': 3, 'reply_count': 0, 'like_c..."
7,1450894250040860682,175170901,RT @SirCoreGant: Side by Side #WhiteFace #Kany...,en,2021-10-20T18:38:40.000Z,"{'retweet_count': 6, 'reply_count': 0, 'like_c..."
8,1450893741468917761,1256270849721741314,@kanyewest spotted masked up in NYC wearing a ...,en,2021-10-20T18:36:38.000Z,"{'retweet_count': 0, 'reply_count': 0, 'like_c..."
9,1450893655888252931,1316656125509541888,First thing that came to mind when I saw this ...,en,2021-10-20T18:36:18.000Z,"{'retweet_count': 0, 'reply_count': 0, 'like_c..."


### Reflection ###

##### Quality #####
The data collected can be used to answer my driving question, although it may not be the most reliable. The reasons that I outlined earlier as to why Donda has many strengths for this kind of analysis also are the reasons why the data may be problematic. Including #kanye would open up the analysis to tweets that may be about the album but do not include the hashtags as well as sentiments about Kanye as a person that could influence the perception of the album, but this also leads to the issue of irrelevant tweets. Not every #kanye tweet will be about Donda and this can skew the sentiment I am trying to collect. While a tweet talking about Kanye's previous music would be relevant, a tweet talking about Kanye's political affiliations would not. Also, this DataFrame has 300 tweets, which is not enough to have the majority of people's opinions. Making a claim about the public has a whole would be irresponsible with a dataset of this size. 

##### Next Steps #####
In order to improve the data collected for analysis, I should address the quality concerns by revising my query and including more data points. Different capitalizations of the hashtags Donda and Kanye as well as tangentially related hashtags like music and previous album names while also using the AND logic with Kanye's name may yield more specific results and weed out irrelevant data while maintaining sentiments surrounding the album. To address the size concerns, I would keep running pagination of tweets until I had upwards of 5,000 - 10,000 tweets pulled. These numbers may still be small. To know how many tweets will be sufficient, I will need to know how many tweets are yielded from my new query. Another improvement to pagination would be to create a function that will automatically gather the next 100 tweets based on a number of times you want it to run. This would eliminate the verbose method currently implemented in favor of a more terse method, leading to more optimized runtimes and processing.