### Importing Pandas

First import the Pandas library:

In [2]:
import pandas as pd

### Loading Data

The data can be loaded into a Pandas from CSV files or Excels. Here's an example of loading data from a CSV file:

In [3]:
# Load data from a CSV file
df = pd.read_csv('./datasets/dataset_racism/twitter_scrapping.csv')

### Get The Structure of the DataSet
Once the data is loaded, the structure of the data can be obtained by listing the name of the columns

In [4]:
# Get the column names as a list
column_names = df.columns.tolist()

# Print the column names
print()
print("------------- COLUMN NAMES -------------")
print()
for c in column_names:
    print(c)

# Get the number of rows and columns
num_rows, num_cols = df.shape
print()
print("--------------- STRUCTURE ---------------")
print()
print("Number of rows:", num_rows)
print("Number of columns:", num_cols)


------------- COLUMN NAMES -------------

Unnamed: 0.1
Unnamed: 0
Nombre de Usuario
Nombre Visible
ID del Usuario
Descripción del Usuario
Ubicación
Tweet
Fecha de Creación
Cantidad de likes
Idioma
En respuesta a

--------------- STRUCTURE ---------------

Number of rows: 4494508
Number of columns: 12


### Exploring the Data

Once your data is loaded, you can explore it using various Pandas functions:

In [5]:
# Display the first few rows
print(df.head())

# Get information about the DataFrame
print(df.info())

# Check for missing values
print(df.isnull().sum())

# Statistical summary
print(df.describe())


   Unnamed: 0.1  Unnamed: 0         Nombre de Usuario Nombre Visible  \
0             0           0                    惹句👯‍♀️     LiI_Nigger   
1             1           1  Croispin O'Mhaghadia PhD       croispin   
2             2           2                    惹句👯‍♀️     LiI_Nigger   
3             3           3                    ＜お前凍結     OnanismAsp   
4             4           4                    2階3列4番  aya_is_kawaii   

   ID del Usuario                            Descripción del Usuario  \
0    1.429793e+18                      どひゃー選手権関東地区代表\r\n地獄(@jack_VTB   
1    1.429793e+18  Climate Father, Custodian Of Gaia, Doctor of R...   
2    1.429793e+18                      どひゃー選手権関東地区代表\r\n地獄(@jack_VTB   
3    1.429793e+18                                                NaN   
4    1.429793e+18                                           槌骨 砧骨 鐙骨   

                                           Ubicación  \
0                                            まるち！？！？   
1                     

### Filtrate the Data
Only the tweets in english will be used for the project

In [6]:
filtered_df = df[df['Idioma'] == 'en']

# Get the number of rows and columns
num_rows, num_cols = filtered_df.shape
print()
print("--------------- STRUCTURE ---------------")
print()
print("Number of rows:", num_rows)
print("Number of columns:", num_cols)


--------------- STRUCTURE ---------------

Number of rows: 3715313
Number of columns: 12


### Create New Structures

For this project we need 3 types of files:

- CSV1: Realations between users
- CSV2: Weigh of each user base on tweeter likes
- JSON: Tweeters by each user

In [10]:
# Display the first few rows
filtered_df.to_csv('./datasets/proccessed/racism/filtered_df.csv', index=False) 

likes_sum = filtered_df.groupby("ID del Usuario")["Cantidad de likes"].sum()
filtered_df = filtered_df[filtered_df["ID del Usuario"].isin(likes_sum[likes_sum > 100].index)]
csv_1 = filtered_df[['Nombre Visible', 'En respuesta a']].rename(columns={'Nombre Visible': 'Source', 'En respuesta a': 'Target'}).assign(
    Link=1,
    Target=lambda df: df['Target'].fillna(df['Source'])
).dropna(subset=["Target"])
csv_1 = csv_1.groupby(['Source', 'Target'])['Link'].sum().reset_index()
csv_1 = csv_1[(csv_1['Link'] >= 100)]
csv_1.to_csv('./datasets/proccessed/racism/graph.csv', index=False)

csv_2 = filtered_df.groupby('Nombre Visible')['Cantidad de likes'].sum().reset_index()
csv_2 = csv_2.rename(columns={'Nombre Visible': 'Visible Name', 'Cantidad de likes': 'Total Likes'})
csv_2.to_csv('./datasets/proccessed/racism/likes.csv', index=False)


In [11]:
import json
df_unique = filtered_df.drop_duplicates(subset=['Nombre Visible', 'Tweet'])
tweets = df_unique.groupby('Nombre Visible')['Tweet'].apply(list).to_dict()
data = [{"user_id": user, "msgs": msgs} for user, msgs in tweets.items()]

json_data = json.dumps(data, indent=4, ensure_ascii=False, separators=(',', ': \n'))

# Save the JSON to a file
with open('./datasets/proccessed/racism/tweets_by_user.json', 'w') as f:
    f.write(json_data)