<div style="text-align:left;"><img src="images/elden_ring.jpeg" style="display:inline-block;"/></div>

# Elden Ring

Elden Ring is a video game released in February 2022 by FromSoftware Inc. The company is known for creating very difficult Role Playing Games such as Demon's Souls, Dark Souls 1-3, Bloodborne and Sekiro. Elden Ring follows this tradition of throwing bosses in your way that aim to humble you. 

Using Twitter API, I will be extracting tweets which mention Elden Ring in their text, store the data in mongoDB, clean it and analyze it. There are famous streamers which will without a doubt have mentioned Elden Ring in their Tweets - Before, during and after the release of the game. Using the data 'users' which connect the tweets to the posters, I will be analyzing some of what these streamers said. 

## Summary, Overview

The database consists of two collections, one for the tweets and one for users. Each user can write multiple tweets and a tweet is always written by one user. Both collections have a 'public_metrics' field, which contain information on how many likes, followers, retweets and so on the tweet / user has.

<img src="images/uml_diagram.png" >

The architecture of the project consists of the ETL process, where we extract the data from the Twitter API in Python, clean the data and then load it back into our mongoDB collections. Finally, we will use several tools to analyse the data, mostly focusing on PyMongo. 

<img src="images/architecture.png">

##  Requirements & Configuration

In [1]:
import pymongo
from pprint import pprint
import pandas as pd
import requests
import json
import os
import time

import numpy as np
from dotenv import dotenv_values

Twitter API is special in numerous ways. It is not possible to simply call the request() function on an URL. At least not to get the data I want. Twitter usually only allows to extract tweets 14 days back. However, I have an academic account which gives me full archive access. But because Twitter only returns a max of 500 tweets per request, I have to call the API multiple times until no more tweets matching my search criteria are returned. For this, Twitter sends a token with each request which allows me to continue where I left off. I will look at tweets starting from 13 February to 28 February 2022 (one week before and after release date of 20 February 2022). I initially planned to look at tweets from the announcement date 11 June 2019 until 10 May 2022. However, the storage limit of 512 MB on the mongoDB database does not allow for that many tweets to be saved. 

To access the Twitter API - as well as the mongodb database, I will need different credentials. It is good practice to not include passwords and usernames in the code. The dotenv library provides an easy way to access these credentials from outside this notebook. The following code block extracts the credentials we have saved in a .env file and stores them in variables.

In [2]:
config = dotenv_values(".env")
USER = config['USER']
PASSWORD = config['PASSWORD']
BEARER_TOKEN = config['BEARER_TOKEN']

In [7]:
# API and Database details
API_URL = "https://api.twitter.com/2/tweets/search/all"
CNX_STR = "mongodb+srv://" + USER + ":" + PASSWORD + "@cluster0.tbqzv.mongodb.net"
DB_NAME = "elden_ring"
COLL_TWEETS = "twitter"
COLL_USERS = "users"

In [9]:
# connection to MongoDB
client = pymongo.MongoClient(CNX_STR)
db = client[DB_NAME]
twitter = db[COLL_TWEETS]
users = db[COLL_USERS]

In [34]:
testcol = db['test']

## ETL

### Remove all existing documents -> Reset collection

The below cells delete the content of the existing collections. However, downloading the data from the Twitter API takes a good amount of time, so it is not recommended to do this without good reason. 

In [10]:
#twitter.drop()
twitter.count_documents({})

281843

In [53]:
# users.drop()
users.count_documents({})

141210

### Define Query Parameters

Below we define the query so we only get the tweets we are interested in. We are looking for all tweets containing the strings elden and ring. Only tweets categorized as English will be returned. Retweets and replies are not included. I am starting the search one week before the release of the game and ending it one week after. The maximum of tweets that can be returned per query is 500. In tweet_fields I write down what information I want with each tweet (aside from those that are default). We will get a second collection containing information about the users. 

In [10]:
# define query parameters 
query = "elden ring lang:en -is:retweet -is:reply"  # returns every tweet containing the words elden and ring which have been classified as english, excluding retweets and replies
start_time = "2022-02-13T00:00:00.000Z"  # one week before Elden Ring release
end_time = "2022-02-28T23:59:59.000Z" # one week after Elden Ring release
max_results = "500"
tweet_fields = "created_at,author_id,geo,in_reply_to_user_id,lang,public_metrics" # https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet
user_fields = 'username,location,public_metrics' 
expansions = 'author_id'

# put query parameters in a dict
query_params = {'query': query,'tweet.fields': tweet_fields, 'user.fields': user_fields,  \
                'start_time': start_time, 'end_time': end_time, 'max_results': max_results,\
                'expansions': expansions}



headers = {"Authorization": "Bearer " + BEARER_TOKEN}

###  Fetch data and insert into MongoDB

In [11]:
tweet = []
user = []
while True:
    # get results according to url and query
    response = requests.request("GET", API_URL, headers=headers, params=query_params)
    if response.status_code != 200:
        raise Exception(response.status_code, response.text)

    # combine nah jdata to one
    json_response = response.json()
    if 'data' in json_response:
        tweet = tweet + json_response['data']
        user = user + json_response['includes']['users']
        
    # write data into mongoDB collection
    twitter.insert_many(json_response['data'])
    users.insert_many(json_response['includes']['users'])
    
    # check if more data available, if yes continue process
    if 'meta' in json_response:
        if 'next_token' in json_response['meta']:
            query_params['next_token'] = json_response['meta']['next_token']
            next_token = json_response['meta']['next_token']
            # logging.info("Fetching next few tweets, next_token: ", query_params['next_token'])
            time.sleep(5)
        else:
            if 'next_token' in query_params:
                del query_params['next_token']
            break
    else:
        if 'next_token' in query_params:
            del query_params['next_token']
        break

In [7]:
# count number of documents inserted
print(twitter.count_documents({}))
print(users.count_documents({}))


281843
275221


In [8]:
twitter.find_one()

{'_id': ObjectId('6280a3558c6d53ffec73c796'),
 'author_id': '1492217705348231169',
 'id': '1498447916612009986',
 'lang': 'en',
 'public_metrics': {'retweet_count': 0,
  'reply_count': 0,
  'like_count': 7,
  'quote_count': 0},
 'created_at': '2022-02-28T23:59:57.000Z',
 'text': 'Come see me get absolutely annihilated in Elden Ring tonight ðŸ¥° https://t.co/XmvKPi1A5D'}

In [9]:
users.find_one()

{'_id': ObjectId('6280a3568c6d53ffec73c988'),
 'name': 'AtomicAshe',
 'id': '1492217705348231169',
 'username': 'atomic_ashe',
 'public_metrics': {'followers_count': 104,
  'following_count': 35,
  'tweet_count': 10,
  'listed_count': 0}}

### Data Cleaning

We have over 280'000 unique tweets in our collection. However, there are many duplicates in the users collection. While the API returns the unique users for all the tweets returned only once, it will do so for every request. I have executed multiple requests due to the limit of 500, meaning we also have the same users multiple times. Luckily, they are all identified with an ID, which means we can remove all duplicate IDs from the collection. 

I also specified the API to return the field 'lang' or language. However, since I only requestes tweets which are categorized as English, that field is superfluous as the value is the same for every document. I can therefore savely remove it. 

Seeing how long it took to download the data, I have exported the collections with the below code (in the terminal). That way, in case something goes wrong and I delete information I need, I can restore the data from the initial checkpoint. 

```
mongoexport --db=elden_ring --collection=tweets --type=json --out=tweets.json "mongodb+srv://iaschwen:*******@cluster0.tbqzv.mongodb.net"
```

I have done the same for the collection 'users' and saved it as users.json

In [38]:
# save collections as pd dataframe
a = twitter.aggregate([])
df_twitter = pd.DataFrame(a)

b = users.aggregate([])
df_users = pd.DataFrame(b)



In [39]:
# drop column 'lang' and 'withheld'
df_twitter.drop(['lang', 'withheld'], axis=1, inplace=True)

# drop duplicates from table users
df_users = df_users.drop_duplicates(subset='id').reset_index(drop=True)
df_users.drop(['withheld'], axis=1, inplace=True)

# drop the existing collections and insert the clean pandas df back into them
users.drop()
twitter.drop()

db.users.insert_many(df_users.to_dict('records'))
db.twitter.insert_many(df_twitter.to_dict('records'))

<pymongo.results.InsertManyResult at 0x7f4524545820>

In [40]:
users.find_one()

{'_id': ObjectId('6280a3568c6d53ffec73c988'),
 'name': 'AtomicAshe',
 'id': '1492217705348231169',
 'username': 'atomic_ashe',
 'public_metrics': {'followers_count': 104,
  'following_count': 35,
  'tweet_count': 10,
  'listed_count': 0},
 'location': nan}

In [11]:
twitter.find_one()

{'_id': ObjectId('6280a3558c6d53ffec73c796'),
 'author_id': '1492217705348231169',
 'id': '1498447916612009986',
 'public_metrics': {'retweet_count': 0,
  'reply_count': 0,
  'like_count': 7,
  'quote_count': 0},
 'created_at': '2022-02-28T23:59:57.000Z',
 'text': 'Come see me get absolutely annihilated in Elden Ring tonight ðŸ¥° https://t.co/XmvKPi1A5D',
 'geo': nan}

## Data analysis

Quisque sit amet turpis lectus. Phasellus tincidunt mi metus, et ornare ipsum consectetur eu. Cras accumsan purus vel leo viverra, at mollis neque interdum. Sed non ultrices odio, vitae sodales neque. Quisque diam odio, gravida quis auctor ut, aliquet ac ex. Integer venenatis elit ex, vitae imperdiet tortor malesuada quis. Vestibulum dignissim est sed libero viverra interdum.

### Categories

In [87]:
c = jokes.aggregate([
    {"$project": {"joke": 0}},
    {"$unwind": "$categories"},
    {"$group": {"_id": "$categories", "count": {"$sum": 1}}},
 ])

pd.DataFrame(c)

Unnamed: 0,_id,count
0,nerdy,105


In [88]:
c = jokes.aggregate([
    {"$match": {"categories":  {"$in" : ["nerdy"]}}},
])

pd.DataFrame(c)

Unnamed: 0,_id,id,joke,categories
0,612cfd7d9d461a7ec9a683d6,20,The Chuck Norris military unit was not used in...,[nerdy]
1,612cfd7d9d461a7ec9a683db,26,Chuck Norris is the only human being to displa...,[nerdy]
2,612cfd7d9d461a7ec9a683e4,36,Chuck Norris originally appeared in the &quot;...,[nerdy]
3,612cfd7d9d461a7ec9a68402,69,Scientists have estimated that the energy give...,[nerdy]
4,612cfd7d9d461a7ec9a68549,412,Chuck Norris knows the last digit of pi.,[nerdy]
...,...,...,...,...
100,612cfd7d9d461a7ec9a685d5,565,Chuck Norris can make a class that is both abs...,[nerdy]
101,612cfd7d9d461a7ec9a685d6,566,Chuck Norris could use anything in java.util.*...,[nerdy]
102,612cfd7d9d461a7ec9a685d7,567,Code runs faster when Chuck Norris watches it.,[nerdy]
103,612cfd7d9d461a7ec9a685d8,584,Only Chuck Norris shuts down websites without ...,[nerdy]


### Jokes

Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec in risus sed augue blandit tincidunt eu nec leo. Phasellus suscipit ex ut luctus auctor. Mauris efficitur finibus nunc, gravida pulvinar metus commodo eget. Quisque quis orci vehicula, maximus tellus sit amet, dignissim ligula. Proin auctor, tellus eget tempus imperdiet, nunc nisi laoreet tellus, nec viverra ipsum quam in quam. 

Nam ut pellentesque arcu. Ut faucibus elit enim, nec tincidunt massa mattis id. Cras tortor urna, tempus eu viverra quis, suscipit sed magna. Mauris eget eleifend leo, ut tristique justo. In quis lectus eu neque euismod bibendum non in mi. In lobortis iaculis pulvinar. Morbi et mi neque. Etiam maximus elementum metus, non auctor dui eleifend ac.

In [89]:
c = jokes.aggregate([
    {"$match": {"joke":  {"$regex" : "chuck",  "$options": ""}}},
])

pd.DataFrame(c)

Unnamed: 0,_id,id,joke,categories
0,612cfd7d9d461a7ec9a68404,72,How much wood would a woodchuck chuck if a woo...,[]
1,612cfd7d9d461a7ec9a6856e,456,All browsers support the hex definitions #chuc...,[nerdy]


In [90]:
c = jokes.aggregate([
    {"$project": {"joke": 1}},
    {"$match": {"joke":  {"$not": {"$regex" : "chuck",  "$options": "i"}}}},
])

pd.DataFrame(c)

Unnamed: 0,_id,joke
0,612cfd7d9d461a7ec9a6843b,"There is in fact an 'I' in Norris, but there i..."
1,612cfd7d9d461a7ec9a6843d,An anagram for Walker Texas Ranger is KARATE W...
2,612cfd7d9d461a7ec9a6844d,"Superman once watched an episode of Walker, Te..."
3,612cfd7d9d461a7ec9a68450,Movie trivia: The movie &quot;Invasion U.S.A.&...
4,612cfd7d9d461a7ec9a68457,"Once you go Norris, you are physically unable ..."
5,612cfd7d9d461a7ec9a6848d,Crime does not pay - unless you are an underta...
6,612cfd7d9d461a7ec9a6854a,Those aren't credits that roll after Walker Te...


In [91]:
c = jokes.aggregate([
    {"$match": {"joke":  {"$regex" : "chuck",  "$options": ""}}},
])

pd.DataFrame(c)

Unnamed: 0,_id,id,joke,categories
0,612cfd7d9d461a7ec9a68404,72,How much wood would a woodchuck chuck if a woo...,[]
1,612cfd7d9d461a7ec9a6856e,456,All browsers support the hex definitions #chuc...,[nerdy]


Curabitur vel magna nec ipsum pulvinar imperdiet vitae vitae nisi. Pellentesque mattis ultricies diam eu cursus. Maecenas eleifend ante arcu, at feugiat erat eleifend eu. In volutpat faucibus dui, sed faucibus ligula faucibus et. Maecenas convallis sodales sollicitudin. Ut consectetur, arcu ac imperdiet rutrum, massa nisi sollicitudin odio, vel mattis mi augue et sem. 

Fusce semper porta risus, vitae hendrerit mauris congue vitae. Praesent venenatis varius lacus. Cras tempor augue lectus, at iaculis ex pretium sit amet. In hac habitasse platea dictumst. Nunc pharetra est eu pellentesque hendrerit. Ut nec varius sem. Morbi eu elit id lacus laoreet pharetra.

## Conclusions

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. 

Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. 

Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat. Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. 

## Remarks

### Learnings

Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.

In [4]:
%%HTML
<style>
/* display:none  -> hide In/Out column */
/* display:block -> show In/Out column */
div.prompt {display:none}
</style>