# <span style="font-size: 1em">From raw data to temporal graph structure exploration</span><span style="font-size: 0.8em"> Second Assignment</span>
<h3>Social Network Analysis 2022-2023</h3>
<h5>M.Sc. In Business Analytics (Part Time) 2022-2024 at Athens University of Economics and Business (A.U.E.B.)</h5>

---

> Student: Panagiotis G. Vaidomarkakis<br />
> Student I.D.: p2822203<br />
> Instructor: Dr. Katia Papakonstantinopoulou<br />
> Due Date: 08/07/2023

In [1]:
# Importing the module
import os
print("The Current working directory is: {0}".format(os.getcwd()))
# Changing the current working directory
#os.chdir() <-- put here your working directory
print("The Current working directory now is: {0}".format(os.getcwd()))

The Current working directory is: c:\Users\pvaidoma\Downloads\Social Network Graphs
The Current working directory now is: c:\Users\pvaidoma\Downloads\Social Network Graphs


(You should use the following cell only if you haven't install tqdm package)

In [2]:
# %pip install tqdm

The following code contains all the data preparation which needed in order to extract the data for the specific days that we need.<br>
I tried to follow all the instructions to extract the right info.<br>
From my point of view, I have decided to exclude all hashtags that contains less than 3 numbers because they don't seem to mean something. For example, #1 can mean anything and it is something very common. 4 or more digits are acceptable, because of dates (for example #1995).<br>
I kept all hashtags that have letters or numbers, for example #1music is an acceptable hashtag.<br>
All the other extract is based on the instructions.

In [3]:
import datetime
import re
import time
import random
import gzip
from tqdm import tqdm

# Operations timer
start_time = time.time()

def ParseTweet(tweet):
    mentions = re.findall(r'@(\w+)', tweet)
    hashtags = re.findall(r'#(\w+)', tweet)
    filtered_hashtags = [f"#{tag}" for tag in hashtags if re.search(r'^\d{4,}|[a-zA-Z]', tag)]
    return mentions, filtered_hashtags

def ReadTweet(f, user_hashtags):
    f.readline()
    TimeLine = f.readline()
    UserLine = f.readline()
    TweetLine = f.readline()

    Timestamp = datetime.datetime.strptime(TimeLine.split('\t')[1].strip(), '%Y-%m-%d %H:%M:%S')
    Username = UserLine.split('\t')[1].strip().split('/')[-1]
    Tweet = TweetLine.split('\t')[1].strip()

    Mentions, Hashtags = ParseTweet(Tweet)

    return (Timestamp, Username, Mentions, Hashtags)

AggregateData = dict()
UserHashtags = dict()  # Dictionary to store hashtag counts for each user for each day
raw_data_file = "tweets2009-07.txt.gz" # <-- this can be found at https://drive.google.com/file/d/1RjWUg-6KrVOjJPZHHQg-h_9gSSWZUPn-/view
with gzip.open(raw_data_file, 'rt', encoding='utf-8') as f:
    i = 0
    pbar = tqdm(total=46200000)  # Set the total number of records to process

    while True:
        i += 1
        if i % 100000 == 0:
            pbar.update(100000)  # Update the progress bar

        try:
            Timestamp, Username, Mentions, Hashtags = ReadTweet(f, UserHashtags)
        except:  # End of file
            break

        if Timestamp < datetime.datetime(2009, 7, 1, 0, 0, 0) or Timestamp > datetime.datetime(2009, 7, 5, 23, 59, 59):
            continue

        # Create dictionaries for each day and each user if they don't exist
        if Timestamp.date() not in AggregateData.keys():
            AggregateData[Timestamp.date()] = dict()
        if Timestamp.date() not in UserHashtags.keys():
            UserHashtags[Timestamp.date()] = dict()

        for mention in Mentions:
            AggregateData[Timestamp.date()][(Username, mention)] = AggregateData[Timestamp.date()].get((Username, mention), 0) + 1

        # Increment hashtag counts for the user
        if Username not in UserHashtags[Timestamp.date()]:
            UserHashtags[Timestamp.date()][Username] = dict()
        for hashtag in Hashtags:
            UserHashtags[Timestamp.date()][Username][hashtag] = UserHashtags[Timestamp.date()][Username].get(hashtag, 0) + 1

    pbar.close()  # Close the progress bar

# Determine the most important topic for each user for each day
UserTopics = dict()
for date, hashtags in UserHashtags.items():
    UserTopics[date] = dict()
    for user, user_hashtags in hashtags.items():
        if user_hashtags:
            max_count = max(user_hashtags.values())
            max_topics = [topic for topic, count in user_hashtags.items() if count == max_count]
            random_topic = random.choice(max_topics)
            UserTopics[date][user] = random_topic
        else:
            UserTopics[date][user] = 'null/na'

# Write user topics to separate CSV files for each day and for each user
for date, topics in UserTopics.items():
    filename = date.strftime('topic_of_interest_%Y_%m_%d.csv')
    with open(filename, 'w', encoding='utf8') as f:
        f.write('user,topic_of_interest\n')
        for user, topic in topics.items():
            f.write('{},{}\n'.format(user, topic))

# Write mention data to CSV files for each day
for date, data in AggregateData.items():
    filename = date.strftime('edgelist_%Y_%m_%d.csv')
    with open(filename, 'w', encoding='utf8') as f:
        f.write('from,to,weight\n')
        for pair, weight in data.items():
            f.write('{},{},{}\n'.format(pair[0], pair[1], weight))

# End of timer
elapsed_time = time.time() - start_time
minutes, seconds = divmod(elapsed_time, 60)
print("All CSV files are ready!")
print("Elapsed Time: {:.0f} minutes and {:.2f} seconds".format(minutes, seconds))

100%|██████████| 46200000/46200000 [09:08<00:00, 84246.98it/s]


All CSV files are ready!
Elapsed Time: 9 minutes and 12.01 seconds
