# Analysis of Alt-text use in mastodon.social by client

A project by [Cristal Rivera](https://linkedin.com/in/cristal-rivera-picado/) and [Tommaso Marmo](https://tommi.space/) to analyse the use of alt-text in [mastodon.social](https://mastodon.social/about), using [Stefan Bohacek](https://stefanbohacek.com/)’s <cite>[mastodon.social alt text use by client app](https://www.kaggle.com/datasets/fourtonfish/mastodon-social-alt-text-use-by-client-app)</cite> dataset, published on Kaggle under the [MIT license](https://www.mit.edu/~amini/LICENSE.md).

This analysis is being conducted as a group project for the [Introduction to Data Science](https://ois2.tlu.ee/tluois/subject/ULP6613-23265) course of the <cite>[Artificial Intelligence and Sustainable Societies](https://aissprogram.eu)</cite> master.

In [1]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# Import and print dataset
data = pd.read_csv('./fediverse-client-alt-text-data-2024-05-13.csv')
data

Unnamed: 0,client,status_count,descriptions_all_count,descriptions_all_percent,descriptions_some_count,descriptions_some_percent,descriptions_none_count,descriptions_none_percent
0,Web,8272,1438,17.383946,7,0.084623,6827,82.531431
1,dlvr.it,5806,1,0.017224,0,0.000000,5805,99.982776
2,Mastodon for Android,1894,270,14.255544,3,0.158395,1621,85.586061
3,unknown,1428,366,25.630252,1,0.070028,1061,74.299720
4,AboveMaidstoneBot,1339,0,0.000000,0,0.000000,1339,100.000000
...,...,...,...,...,...,...,...,...
261,socialbot,1,1,100.000000,0,0.000000,0,0.000000
262,PhonocasterMusicShare,1,0,0.000000,0,0.000000,1,100.000000
263,openvibe,1,0,0.000000,0,0.000000,1,100.000000
264,iflaapp,1,0,0.000000,0,0.000000,1,100.000000


## Data wrangling

In [3]:
print(f'Confirm data type of the dataset: {type(data)}')

Confirm data type of the dataset: <class 'pandas.core.frame.DataFrame'>


In [4]:
# Experimental example taken from ChatGPT
def summarize_dataframe(df):
    """
    Provides a summary of each column in the DataFrame, including:
    - Data type
    - Number of missing values
    - Number of unique values
    - Descriptive statistics (for numerical columns)
    
    Parameters:
    df (pd.DataFrame): The DataFrame to summarize.
    """
    # Data type and count of missing values
    summary = pd.DataFrame({
        'Data Type': df.dtypes,
        'Missing Values': df.isnull().sum(),
        'Unique Values': df.nunique()
    })
    
    # Descriptive statistics for numerical columns
    numerical_summary = df.describe().T
    summary = summary.merge(numerical_summary, left_index=True, right_index=True, how="left")
    
    # Display summary for categorical columns separately
    categorical_columns = df.select_dtypes(include=['object']).columns
    categorical_summary = df[categorical_columns].describe().T
    
    print("Summary of Numerical Columns:")
    display(summary)
    print("\nSummary of Categorical Columns:")
    display(categorical_summary)

summarize_dataframe(data)

Summary of Numerical Columns:


Unnamed: 0,Data Type,Missing Values,Unique Values,count,mean,std,min,25%,50%,75%,max
client,object,0,266,,,,,,,,
status_count,int64,0,80,266.0,112.240602,643.625189,1.0,2.0,8.0,39.75,8272.0
descriptions_all_count,int64,0,45,266.0,21.473684,105.282555,0.0,0.0,0.0,4.0,1438.0
descriptions_all_percent,float64,0,44,266.0,36.290411,45.079671,0.0,0.0,0.0,100.0,100.0
descriptions_some_count,int64,0,6,266.0,0.093985,0.622382,0.0,0.0,0.0,0.0,7.0
descriptions_some_percent,float64,0,12,266.0,0.060461,0.527118,0.0,0.0,0.0,0.0,7.142857
descriptions_none_count,int64,0,70,266.0,90.672932,568.47704,0.0,0.0,2.0,20.0,6827.0
descriptions_none_percent,float64,0,44,266.0,63.649128,45.113843,0.0,0.0,100.0,100.0,100.0



Summary of Categorical Columns:


Unnamed: 0,count,unique,top,freq
client,266,266,Today's Dérive app task,1


### Information about the dataset

In [5]:
# Tuple with number of rows and number of columns
data_rows, data_columns = data.shape
# Integer that is the result of n of columns * n of rows
data_size = data.size

print(f'Number of rows: {data_rows}\nNumber of columns: {data_columns}\nData size (rows*columns): {data_size}')

Number of rows: 266
Number of columns: 8
Data size (rows*columns): 2128


In [6]:
data.describe()

Unnamed: 0,status_count,descriptions_all_count,descriptions_all_percent,descriptions_some_count,descriptions_some_percent,descriptions_none_count,descriptions_none_percent
count,266.0,266.0,266.0,266.0,266.0,266.0,266.0
mean,112.240602,21.473684,36.290411,0.093985,0.060461,90.672932,63.649128
std,643.625189,105.282555,45.079671,0.622382,0.527118,568.47704,45.113843
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,8.0,0.0,0.0,0.0,0.0,2.0,100.0
75%,39.75,4.0,100.0,0.0,0.0,20.0,100.0
max,8272.0,1438.0,100.0,7.0,7.142857,6827.0,100.0


In [7]:
# Check if there are any duplicated clients
any(data.duplicated())

False

In [8]:
# Check if there are any missing values in status_count
any(data.status_count.isnull())

False

In [9]:
# Sort clients by status_count
data.sort_values('status_count',ascending=False)

Unnamed: 0,client,status_count,descriptions_all_count,descriptions_all_percent,descriptions_some_count,descriptions_some_percent,descriptions_none_count,descriptions_none_percent
0,Web,8272,1438,17.383946,7,0.084623,6827,82.531431
1,dlvr.it,5806,1,0.017224,0,0.000000,5805,99.982776
2,Mastodon for Android,1894,270,14.255544,3,0.158395,1621,85.586061
3,unknown,1428,366,25.630252,1,0.070028,1061,74.299720
4,AboveMaidstoneBot,1339,0,0.000000,0,0.000000,1339,100.000000
...,...,...,...,...,...,...,...,...
261,socialbot,1,1,100.000000,0,0.000000,0,0.000000
262,PhonocasterMusicShare,1,0,0.000000,0,0.000000,1,100.000000
263,openvibe,1,0,0.000000,0,0.000000,1,100.000000
264,iflaapp,1,0,0.000000,0,0.000000,1,100.000000


### Naming and understanding data fields

All columns correspond to relevant data, and to achieve this we will be going through each column, one by one, to understand its meaning and rename it in more explicatory name.

In [10]:
data = data.rename(columns={
    'status_count': 'posts_count',
    'descriptions_all_count': 'alt-text_all_count',
    'descriptions_all_percent': 'alt-text_all_percent',
    'descriptions_some_count': 'alt-text_some_count',
    'descriptions_some_percent': 'alt-text_some_percent',
    'descriptions_none_count': 'alt-text_none_count',
    'descriptions_none_percent': 'alt-text_none_percent'
})

## Data exploration

In [11]:
data.head(5)

Unnamed: 0,client,posts_count,alt-text_all_count,alt-text_all_percent,alt-text_some_count,alt-text_some_percent,alt-text_none_count,alt-text_none_percent
0,Web,8272,1438,17.383946,7,0.084623,6827,82.531431
1,dlvr.it,5806,1,0.017224,0,0.0,5805,99.982776
2,Mastodon for Android,1894,270,14.255544,3,0.158395,1621,85.586061
3,unknown,1428,366,25.630252,1,0.070028,1061,74.29972
4,AboveMaidstoneBot,1339,0,0.0,0,0.0,1339,100.0


In [12]:
data.tail(5)

Unnamed: 0,client,posts_count,alt-text_all_count,alt-text_all_percent,alt-text_some_count,alt-text_some_percent,alt-text_none_count,alt-text_none_percent
261,socialbot,1,1,100.0,0,0.0,0,0.0
262,PhonocasterMusicShare,1,0,0.0,0,0.0,1,100.0
263,openvibe,1,0,0.0,0,0.0,1,100.0
264,iflaapp,1,0,0.0,0,0.0,1,100.0
265,Today's Dérive app task,1,1,100.0,0,0.0,0,0.0
