# Homework 2

## 1. Create Database and fill it

I will use the twitter sentiment dataset that is mentioned in the task. Originally it is a table with just 3 columns:
*   ItemID -- an index column
*   Sentiment -- a boolean value column, indicating the positive/negative sentiment of the tweet
*   SentimentText -- a text value, containing the text of the tweet



However, analyzing tweet texts we can often find mentions of other users, or some hashtags. I will find such mentions and hashtags in tweets and based on that create the following database with 5 tables:
*   `tweets` table, the original table untouched
*   `users` table, with 2 columns: `user_id`, `username` (only users that were at least once mentioned will be here)
*   `hashtags` table, with 2 columns: `hashtag_id`, `hashtag`
*   `mentions` table, with 3 columns: `mention_id`, `tweet_id` and `mention_id` (an entry here means that tweet with `tweet_id` mentions a user with `user_id`)
*   `hashtag_usage` table, with 3 columns: `hashtag_usage_id`, `tweet_id` and `hashtag_id` (an entry here means that tweet with `tweet_id` contains a hashtag with `hashtag_id`)




Import sqlite and create tables:

In [1]:
import sqlite3

# connecting to the database
conn = sqlite3.connect('twitter.db')

# creating a cursor object we will be sending our queries to
c = conn.cursor()

In [2]:
# delete everything if there was something
c.execute("DROP TABLE IF EXISTS tweets")
c.execute("DROP TABLE IF EXISTS users")
c.execute("DROP TABLE IF EXISTS hashtags")
c.execute("DROP TABLE IF EXISTS mentions")
c.execute("DROP TABLE IF EXISTS hashtag_usage")

# create tables
c.execute("CREATE TABLE tweets(tweet_id INTEGER PRIMARY KEY AUTOINCREMENT, sentiment BOOL, text TEXT)")
c.execute("CREATE TABLE users(user_id INTEGER PRIMARY KEY AUTOINCREMENT, username STRING)")
c.execute("CREATE TABLE hashtags(hashtag_id INTEGER PRIMARY KEY AUTOINCREMENT, hashtag STRING)")
c.execute("CREATE TABLE mentions(mention_id INTEGER PRIMARY KEY AUTOINCREMENT, tweet_id INT, user_id INT)")
c.execute("CREATE TABLE hashtag_usage(hashtag_usage_id INTEGER PRIMARY KEY AUTOINCREMENT, tweet_id INT, hashtag_id INT)")

<sqlite3.Cursor at 0x7f9ad38066c0>

Now load the dataset:

In [3]:
import pandas as pd
import requests

In [4]:
dataset_url = "https://raw.githubusercontent.com/vineetdhanawat/twitter-sentiment-analysis/master/datasets/Sentiment%20Analysis%20Dataset%20100000.csv"

twitter_dataset = pd.read_csv(dataset_url, sep=",", encoding="ISO-8859-1")

In [5]:
twitter_dataset.columns = ['tweet_id', 'sentiment', 'text']
twitter_dataset.head(15)

Unnamed: 0,tweet_id,sentiment,text
0,1,0,is so sad for my APL frie...
1,2,0,I missed the New Moon trail...
2,3,1,omg its already 7:30 :O
3,4,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,5,0,i think mi bf is cheating on me!!! ...
5,6,0,or i just worry too much?
6,7,1,Juuuuuuuuuuuuuuuuussssst Chillin!!
7,8,0,Sunny Again Work Tomorrow :-| ...
8,9,1,handed in my uniform today . i miss you ...
9,10,1,hmmmm.... i wonder how she my number @-)


Fill in the tables:



In [6]:
import re

# iterate over rows
for _, tweet in twitter_dataset.iterrows():
  # insert in the main table
  c.execute("INSERT INTO tweets VALUES (?, ?, ?)", (tweet['tweet_id'], tweet['sentiment'], tweet['text']))

  # find all mentions using regular expressions
  mentions = re.findall("(^|[^@\w])@(\w{1,15})", tweet['text'])
  for _, user in mentions:
    c.execute("SELECT user_id FROM users WHERE username = ?", (user,))
    res = c.fetchall()
    if res == []:
      c.execute("INSERT INTO users (username) VALUES (?)", (user,))
      c.execute("SELECT user_id FROM users WHERE username = ?", (user,))
      user_id = c.fetchall()[0][0]
    else:
      user_id = res[0][0]
    c.execute("INSERT INTO mentions (tweet_id, user_id) VALUES (?, ?)", (tweet['tweet_id'], user_id))

  # find hashtags using regular expressions
  hashtags = list(set(re.findall(r"#(\w+)", tweet['text'])))
  for hashtag in hashtags:
    c.execute("SELECT hashtag_id FROM hashtags WHERE hashtag = ?", (hashtag,))
    res = c.fetchall()
    if res == []:
      c.execute("INSERT INTO hashtags (hashtag) VALUES (?)", (hashtag,))
      c.execute("SELECT hashtag_id FROM hashtags WHERE hashtag = ?", (hashtag,))
      hashtag_id = c.fetchall()[0][0]
    else:
      hashtag_id = res[0][0]
    c.execute("INSERT INTO hashtag_usage (tweet_id, hashtag_id) VALUES (?, ?)", (tweet['tweet_id'], hashtag_id))

In [7]:
# commiting the changes
conn.commit()

Download the data back from DB to pandas dataframes:

In [8]:
tweets_df = pd.DataFrame(pd.read_sql_query("SELECT * FROM tweets", conn), columns = ['tweet_id', 'sentiment', 'text'])
users_df = pd.DataFrame(pd.read_sql_query("SELECT * FROM users", conn), columns = ['user_id', 'username'])
hashtags_df = pd.DataFrame(pd.read_sql_query("SELECT * FROM hashtags", conn), columns = ['hashtag_id', 'hashtag'])
mentions_df = pd.DataFrame(pd.read_sql_query("SELECT * FROM mentions", conn), columns = ['mention_id', 'tweet_id', 'user_id'])
hashtag_usage_df = pd.DataFrame(pd.read_sql_query("SELECT * FROM hashtag_usage", conn), columns = ['hashtag_usage_id', 'tweet_id', 'hashtag_id'])

In [16]:
# closing the database
conn.close()

In [15]:
mentions_df[:25]

Unnamed: 0,mention_id,tweet_id,user_id
0,1,20,1
1,2,46,2
2,3,47,3
3,4,47,4
4,5,49,5
5,6,81,6
6,7,82,7
7,8,82,8
8,9,111,9
9,10,125,10


### 0.1. Some preprocessing

In this handout I will use data from the DraCor Shakespeare corpus ([https://dracor.org/shake](https://dracor.org/shake)). It contains ?????.
Let's load it via API.

In [3]:
corpus_url = "https://dracor.org/api/corpora/shake"

metadata_info = requests.get(corpus_url + "/metadata", headers={"accept":"text/csv"}, stream=True)
metadata_info.raw.decode_content = True

In [4]:
metadata_df = pd.read_csv(metadata_info.raw, sep=",", encoding="utf-8")

In [5]:
metadata_df.head()

Unnamed: 0,name,id,firstAuthor,numOfCoAuthors,title,subtitle,normalizedGenre,digitalSource,originalSourcePublisher,originalSourcePubPlace,...,numEdges,yearWritten,numOfSegments,wikipediaLinkCount,numOfActs,wordCountText,wordCountSp,wordCountStage,numOfP,numOfL
0,a-midsummer-night-s-dream,shake000008,Shakespeare,0,A Midsummer Night’s Dream,,,,,,...,205,1595,9,65,5,17772,17127,789,179,1749
1,all-s-well-that-ends-well,shake000012,Shakespeare,0,All’s Well That Ends Well,,,,,,...,127,1605,24,32,5,25066,24421,826,524,1627
2,antony-and-cleopatra,shake000035,Shakespeare,0,Antony and Cleopatra,,,,,,...,398,1606,42,39,5,27119,25878,1407,149,3236
3,as-you-like-it,shake000010,Shakespeare,0,As You Like It,,,,,,...,110,1599,23,48,5,23721,23113,970,597,1205
4,coriolanus,shake000026,Shakespeare,0,Coriolanus,,,,,,...,361,1608,29,36,5,29948,28851,1460,302,2984


In [6]:
list(metadata_df.columns)

['name',
 'id',
 'firstAuthor',
 'numOfCoAuthors',
 'title',
 'subtitle',
 'normalizedGenre',
 'digitalSource',
 'originalSourcePublisher',
 'originalSourcePubPlace',
 'originalSourceYear',
 'originalSourceNumberOfPages',
 'yearNormalized',
 'size',
 'libretto',
 'averageClustering',
 'density',
 'averagePathLength',
 'maxDegreeIds',
 'averageDegree',
 'diameter',
 'yearPremiered',
 'yearPrinted',
 'maxDegree',
 'numOfSpeakers',
 'numOfSpeakersFemale',
 'numOfSpeakersMale',
 'numOfSpeakersUnknown',
 'numPersonGroups',
 'numConnectedComponents',
 'numEdges',
 'yearWritten',
 'numOfSegments',
 'wikipediaLinkCount',
 'numOfActs',
 'wordCountText',
 'wordCountSp',
 'wordCountStage',
 'numOfP',
 'numOfL']

Delete unnecessary columns: (all filled with NaNs or not useful)

In [7]:
metadata_df[metadata_df.columns.difference(['id',
                                            'firstAuthor',
                                            'numOfCoAuthors',
                                            'subtitle',
                                            'normalizedGenre',
                                            'digitalSource',
                                            'originalSourcePublisher',
                                            'originalSourcePubPlace',
                                            'originalSourceYear',
                                            'originalSourceNumberOfPages',
                                            'libretto',
                                            'yearPremiered',
                                            'numPersonGroups'])].head()

Unnamed: 0,averageClustering,averageDegree,averagePathLength,density,diameter,maxDegree,maxDegreeIds,name,numConnectedComponents,numEdges,...,numOfSpeakersUnknown,size,title,wikipediaLinkCount,wordCountSp,wordCountStage,wordCountText,yearNormalized,yearPrinted,yearWritten
0,0.86987,16.4,1.316667,0.683333,2,24,RobinGoodfellow_MND|Titania_MND,a-midsummer-night-s-dream,1,205,...,3,25,A Midsummer Night’s Dream,65,17127,789,17772,1595,,1595
1,0.792775,8.758621,1.727513,0.312808,3,22,Parolles_AWW,all-s-well-that-ends-well,2,127,...,10,29,All’s Well That Ends Well,32,24421,826,25066,1605,1623.0,1605
2,0.815458,10.756757,2.080954,0.147353,4,47,Antony_JC,antony-and-cleopatra,2,398,...,22,74,Antony and Cleopatra,39,25878,1407,27119,1606,1623.0,1606
3,0.779298,7.857143,1.907407,0.291005,4,18,Orlando_AYL|Touchstone_AYL,as-you-like-it,1,110,...,3,28,As You Like It,48,23113,970,23721,1599,1623.0,1599
4,0.79671,10.776119,1.977415,0.163275,4,54,Coriolanus_Cor,coriolanus,2,361,...,11,67,Coriolanus,36,28851,1460,29948,1608,1623.0,1608
