# Introduction

In this exercise, you will create the twitter collection with #england and #wales hashtags.



## Connect to the database

1. Now, going back to the cluster top page
2. Click "Connect"
3. Slect "Connect your application"
4. Select Python and 3.6 or later, tick "include full driver..."
5. Copy and paste what you see, replace `<password>` with actual password

In [None]:
mongo_uri = "mongodb+srv://..."

## Whitelist the IP of Colab

1. Connect colab, run the following

In [None]:
!curl ipecho.net/plain

2. Go back to MongoDB page
3. Select Network Access on the right
4. Add IP address from the 1. 

## Parepare mongo connection

- First you need to install dnspython

In [None]:
import dns

In [None]:
import pymongo
from pymongo import MongoClient
import pandas as pd

## Connect to the database

In [None]:
cluster = MongoClient(mongo_uri)
db = cluster['gv918-2022-lecture']

In [None]:
db.create_collection("england-wales")

In [None]:
collection = db['england-wales']

## Populate the collection

In [None]:
!wget https://www.dropbox.com/s/ufj9xk9llupod9h/tw_data.csv.tar.gz?dl=0 -O tw_data.csv.tar.gz

In [None]:
df_tweet = pd.read_csv('tw_data.csv.tar.gz')
df_tweet.head()

In [None]:
list_dic = df_tweet.to_dict(orient = "records")

In [None]:
collection.insert_many(list_dic)

- Count the number of tweets in the collection

In [None]:
collection.count_documents({})

## Let's look at the first severals records

In [None]:
list(collection.find({}).limit(5))

## Run queries

Now let's run queries.

#### Documentation

- https://docs.mongodb.com/manual/tutorial/query-documents/
- https://docs.mongodb.com/manual/reference/operator/query/


### Find tweets with URLs

In [None]:
collection.count_documents({"text":{"$regex":"http"}})

### Find tweets with the term England and Wales

- Try the same with case insensitive match (use `$options`)

### How many of them are retweets (i.e. Tweets start with "RT:")


### Counting the number of tweets by user

- Count the number of tweets from each user. Who has the largest nubmer?

In [None]:
list(collection.aggregate([{"$group": {"_id": "$user_screen_name", "count": {"$sum": 1}}}, {"$match": {"count": {"$gt": 100}}}]))

- Convert the results above to a DataFrame. Let's see who has the largest number of tweets


- Concatenate username with "https://twitter.com/" and visit user pages

- Too many spammers. Let's limit the results to the account with more than 100 followers

## Generate time-series plots of tweet volume

- Get documents with "#England" hash tag and create an histgram for every 5min

In [None]:
from datetime import datetime

In [None]:
list(collection.find({"text":{"$regex":"England", "$options":"i"}}, {"_id": 0, "created_at": 1}).limit(10))
ts = [item['created_at'] for item in list(collection.find({"text":{"$regex":"England", "$options":"i"}}, {"_id": 0, "created_at": 1}))]
dt = [datetime.strptime(t,'%a %b %d %H:%M:%S +0000 %Y') for t in ts]
df_time = pd.DataFrame({"ts":ts, "dt":dt})
df_time.groupby(pd.Grouper(key='dt', freq='5Min')).count().plot(kind="line",figsize=(10,4))
df_eng = df_time.groupby(pd.Grouper(key='dt', freq='5Min')).count().rename({"ts": "England"}, axis = 'columns')

## Let's try to create a plot with two series


In [None]:
ts = [item['created_at'] for item in list(collection.find({"text":{"$regex":"wales", "$options":"i"}}, {"_id": 0, "created_at": 1}))]
dt = [datetime.strptime(t,'%a %b %d %H:%M:%S +0000 %Y') for t in ts]
df_time = pd.DataFrame({"ts":ts, "dt":dt})
df_wales = df_time.groupby(pd.Grouper(key='dt', freq='5Min')).count().rename({"ts": "Wales"}, axis = 'columns')

In [None]:
df_tmp = df_wales.join(df_eng).reset_index()

In [None]:
df_tmp_long = df_tmp.melt('dt', var_name='country', value_name='count')


- Try seaborn

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.lineplot(x = "dt", y = "count", hue = 'country', data = df_tmp_long)

- Adjust the x-axis lables

In [None]:
import matplotlib.dates as mdates
plt.figure(figsize = (10, 5))
ax = sns.lineplot(x = "dt", y = "count", hue = 'country', data = df_tmp_long)
ax.xaxis.set_major_locator(mdates.HourLocator(interval=1))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))