# Analyizing Data from Hugginface Hub
This mixed-methods project uses model metadata from Huggingface to understand how 'open' models are described. The goal is to uncover patterns in how the idea of 'open' models are understood. 

## Data Gathering
- Downloaded metadata for models using the Hugginface Hub API. 
- Selected models using full-text-search, looking for any models where the model name or metadata includes the word 'open'
- N = 20,069

## TODO
- License vs. no license
    - Try to find license first!
- License count
- N-grams for 'Open' 
- Most vs. least popular (by downloads)
- Pre OSAID vs Post OSAID
    - How, why is this informative

In [None]:
#Loading data

# importing pandas module  
import pandas as pd  
    
# making data frame  
df = pd.read_csv("model_data/hf_models_open_raw.csv")  
  
# df.head(10) 


#licenses
df['license'].value_counts()

df.head()
# df.sort_values(by="downloads")
# df[df["downloads"] > 50000]
# df.describe(include="all")

# Top 20 most downloaded models
# Downloads this month! Data is accurate as of 8/1/25
top_downloaded = df.sort_values(['downloads'], ascending=[False])
top_downloaded = top_downloaded[0:20]
top_downloaded


# Count occurrences of each author
author_counts = df['author'].value_counts().reset_index()
author_counts.columns = ['author', 'count']

# print(author_counts)

## Raw number of UNIQUE AUTHORS
# Flatten the list of authors across all rows
all_authors = [author for sublist in df["author"] for author in sublist]

# Count unique authors - 63 Unique Authors
unique_authors = set(all_authors)
print("Number of unique authors:", len(unique_authors))
# Number of unique authors: 63


               author  count
0        mradermacher    993
1             shray98    668
2       RichardErkhov    642
3             OpenMed    475
4     open-unlearning    474
...               ...    ...
4637      labicquette      1
4638          miny480      1
4639           GitBag      1
4640       gendisjawi      1
4641       opensource      1

[4642 rows x 2 columns]
Number of unique authors: 63


In [30]:
# CONFIG â€” set this to the column that holds the model identifier/name
author_model_counts = df.groupby("author")["name"].nunique().reset_index(name="models_published")

# If you want a dict: dict(author_model_counts.values)
print(author_model_counts.sort_values("models_published", ascending=False))

               author  models_published
3336     mradermacher               993
3997          shray98               668
1204    RichardErkhov               642
1034          OpenMed               475
3519  open-unlearning               474
...               ...               ...
1936       batuhanaky                 1
1938           bblain                 1
1939         bdokmeci                 1
1940            bds23                 1
4641           zzzzit                 1

[4642 rows x 2 columns]
