# Info

This notebook takes what we did in CleanElonTweetDataNotebook.ipynb but does it all in a few methods, one method for every csv file. You can use the functions directly. 

The read_and_clean_2020 function, returns a cleaned dataframe from the file 2020.csv

The read_and_clean_2021_2022 function, returns a cleaned dataframe from either the 2021.csv file or 2022.csv depending on what you input to the function. 

The last function is used to concatenate the dataframes into 1 dataframe and then uploading it to our feature store on Hopsworks



In [None]:
!pip install hopsworks

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

def read_and_clean_2020():
  df_20 = pd.read_csv("2020.csv")
  
  #Drop unnecessary columns
  drop_cols = ['Unnamed: 0', 'id', 'conversation_id', 'place', 'hashtags', 'cashtags', 'user_id','user_id_str', 
            'link', 'quote_url', 'search', 'geo', 'near', 'source', 'translate', 'trans_src', 'trans_dest', 'retweet_date','thumbnail', 'created_at', 'user_rt_id', 'user_rt',
            'retweet_id', 'reply_to', 'hour', 'username', 'name', 'language', 'nretweets', 'nreplies', 'urls', 'photos', 'video', 'retweet']
  df_20 = df_20.drop(columns = drop_cols)
  #For convenience, we rename date column to datetime since that is what it is
  df_20 = df_20.rename(columns={'date':'datetime'})
  #Fix the time so that it follows swedish time
  df_20['datetime'] = pd.to_datetime(df_20['datetime'])
  df_20['datetime'] = df_20['datetime'] + timedelta(hours=1)
  df_20['date'] = df_20['datetime'].dt.strftime("%Y-%m-%d")
  df_20['time'] = df_20['datetime'].dt.strftime("%H:%M:%S")
  #Datetime and timezone columns are not needed now so we drop them
  df_20 = df_20.drop(columns=['datetime', 'timezone'])
  
  return df_20






In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

def read_and_clean_2021_2022(year):
  if year == 2021:
    df = pd.read_csv("2021.csv")
  if year == 2022:
    df = pd.read_csv("2022.csv")

  drop_cols = ['id', 'conversation_id', 'place', 'hashtags', 'cashtags', 'user_id', 
             'link', 'quote_url', 'geo', 'near', 'source', 'translate', 'trans_src', 'trans_dest', 'retweet_date','thumbnail', 'created_at', 'user_rt_id', 'user_rt',
             'retweet_id', 'reply_to', 'retweet', 'retweet_date', 'username', 'name', 'language', 'mentions', 'urls', 'photos', 'video', 'retweets_count', "replies_count"]
  df = df.drop(columns=drop_cols)

  df = df.rename(columns={'time':'timestamp'})
  df['timestamp'] = pd.to_datetime(df['timestamp'], format='%H:%M:%S')
  # Extract the time from each timestamp
  df['time'] = df['timestamp'].dt.time
  df['time'] = df['time'].astype(str)
  df = df.drop(columns=['timestamp'])

  # Convert the 'date' and 'time' columns to datetime objects
  df['datetime'] = pd.to_datetime(df['date'] + ' ' + df['time'], format='%Y-%m-%d %H:%M:%S')
  # Subtract 3 hours from each timestamp
  df['datetime'] = df['datetime'] - timedelta(hours=3)
  # Format the 'earlier_timestamp' column as a string
  df['datetime'] = df['datetime'].dt.strftime('%Y-%m-%d %H:%M:%S')


  #Drop old times 
  df = df.drop(columns=['time', 'date'])


  #Take the new datetime and create new time and date columns with updated time
  df['datetime'] = pd.to_datetime(df['datetime'])
  df['date'] = df['datetime'].dt.strftime("%Y-%m-%d")
  df['time'] = df['datetime'].dt.strftime("%H:%M:%S")


  #Then we can finally drop the datetime column and timezone column as they are not needed anymore
  df = df.drop(columns=['datetime', 'timezone'])
  df['day']  = pd.to_datetime(df['date'], format='%Y-%m-%d').dt.weekday

  df = df.rename(columns={'likes_count':'nlikes'})

  return df


# Uploading to Hopsworks

In this step, we use the previous two functions to create three dataframes, concatenate them and upload them to Hopsworks. The data can later be accessed from that feature store


In [None]:



import os


def g():
    import hopsworks
    import pandas as pd

    #We use our scripts that reads and clean the csv files
    df_20 = read_and_clean_2020()
    df_21 = read_and_clean_2021_2022(2021)
    df_22 = read_and_clean_2021_2022(2022)
    frames = [df_20, df_21, df_22]
    #Concatenate the dfs into one df
    concat_df = pd.concat(frames)

    feature_list = []

    project = hopsworks.login()
    fs = project.get_feature_store()

    for features in concat_df.columns:
        feature_list.append(features)

    em_tweet_fg = fs.get_or_create_feature_group(
        name="elon_musk_tweets_modal",
        version=1,
        primary_key= feature_list,
        description = "Elon Musks Tweets from 2010-2022")
    em_tweet_fg.insert(concat_df, write_options={"wait_for_job" : False})


if __name__ == "__main__":
    g()

Copy your Api Key (first register/login): https://c.app.hopsworks.ai/account/api/generated

Paste it here: ··········
Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/4251




Connected. Call `.close()` to terminate connection gracefully.
Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/4251/fs/4194/fg/15720


Uploading Dataframe: 0.00% |          | Rows 0/15860 | Elapsed Time: 00:00 | Remaining Time: ?

Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/4251/jobs/named/elon_musk_tweets_modal_1_offline_fg_backfill/executions
