# Exploratory Analysis
## Core descriptive statistics
- Total number of models
- Number of unique organizations or authors
- Distribution of licenses
    - License coverage (Missing metadata counts)

## Author and organization metrics
- Most prolific authors (count of models per author)
- Authors with the most downloads

## Model metrics
- Models with the most children
- Models with the most downloads


## Datasets
- Datasets most used
- Distribution of licenses

## License relationships
- License propagation
    - Does the license of children typically match that of their base model?

## Genealogy & derivation insights
- Average number of derivations per model
- Most influential models (top N by descendant count)
- Model lineage length — longest chain of derivation (depth)

- Proportion of derived vs. original models

## Temporal patterns
- Model releases over time
    - by month or year (time series)
- Trends by license
    - Are permissive or restrictive licenses increasing?


## Tags, tasks, and categories
- Most common task types
    - Tags like “text-generation,” “image-classification,” etc.:

- Co-occurrence of tasks and licenses


## Further down the road
Author specialization
- Which authors dominate particular task categories?

Co-occurrence
- Licenses 
- Models/datasets

# Analysis Guiding Questions

| **Category**          | **Example Question**                                                 |
|------------------------|----------------------------------------------------------------------|
| License Environment    | Which licenses dominate?                                              |
| License Environment    | How do licenses propagate through time?                               |
| Authorship Patterns    | Who builds the most models, and how do they collaborate?              |
| Genealogy Influence    | Which models are most reused as bases?                                |
| Temporal Dynamics      | When did major growth periods occur?                                 |
| Topical Trends         | Which task domains are expanding fastest?                            |

# License Environment
## 1.1 Which Licenses Dominate

In [2]:
import pandas as pd

# Load your CSV file (replace 'your_file.csv' with your actual filename)
df = pd.read_csv('model_data/hf_models_open_raw.csv')

# Count occurrences of each license
license_counts = df['license'].value_counts().reset_index()
license_counts.columns = ['license', 'count']

# Calculate proportion
total_models = len(df)
license_counts['proportion_percent'] = 100*(license_counts['count'] / total_models)

license_counts

Unnamed: 0,license,count,proportion_percent
0,apache-2.0,4697,23.403089
1,mit,1086,5.411061
2,other,814,4.055805
3,cc-by-4.0,229,1.141006
4,llama2,223,1.111111
5,cc-by-nc-4.0,222,1.106129
6,llama3,222,1.106129
7,creativeml-openrail-m,111,0.553064
8,llama3.1,106,0.528151
9,llama3.2,82,0.40857


Unnamed: 0,name,id,model_type,license,date_released,date_last_modified,downloads,author
114,IFT-GEMBA-multilingual-Openchat,surrey-nlp/IFT-GEMBA-multilingual-Openchat,,other,2025-10-01T12:01:30+00:00,2025-10-01T12:14:30+00:00,7.0,surrey-nlp
123,OpenAI-Clip,qualcomm/OpenAI-Clip,image-classification,other,2024-02-25T22:53:55+00:00,2025-09-30T23:56:05+00:00,326.0,qualcomm
329,Qwen3-pruned-6L-from-0.6B-int8-ov,OpenVINO/Qwen3-pruned-6L-from-0.6B-int8-ov,,other,2025-09-16T11:43:43+00:00,2025-09-24T14:25:43+00:00,27.0,OpenVINO
632,qwen-2.5-openr1-random-subset-rlsft-wrongcode,Xkev/qwen-2.5-openr1-random-subset-rlsft-wrong...,text-generation,other,2025-09-13T07:18:20+00:00,2025-09-15T15:10:21+00:00,2.0,Xkev
676,sdxl-turbo-openvino,Aminfri/sdxl-turbo-openvino,text-to-image,other,2025-09-13T21:32:49+00:00,2025-09-13T21:33:53+00:00,,Aminfri
...,...,...,...,...,...,...,...,...
19697,oasst-sft-6-llama-30b-xor,OpenAssistant/oasst-sft-6-llama-30b-xor,,other,2023-04-22T14:49:47+00:00,2023-04-27T14:07:15+00:00,,OpenAssistant
19741,opt-350m-finetuned-openbookcorpus,nizar-sayad/opt-350m-finetuned-openbookcorpus,text-generation,other,2023-04-06T06:23:19+00:00,2023-04-06T07:03:31+00:00,2.0,nizar-sayad
19746,TEST,openSNH/TEST,,other,2023-03-30T11:26:05+00:00,2023-03-30T11:26:06+00:00,,openSNH
19902,bt-opt-350m,opentensor/bt-opt-350m,text-generation,other,2022-09-22T01:47:44+00:00,2022-12-26T01:21:37+00:00,4.0,opentensor


## 1.2 How do licenses propagate through time