# Fundamentals of Social Data Science

## Week 2 Day 2 Lab: Downloading to Wikipedia

Today we will review some changes to the Wikipedia code. These changes will considerably alter what you are able to do with this code. The end result will be a set of two folders, `data` and `dataframes` which you can use for analysis of Wikipedia.

The code has now been altered on my end in several ways:

- use and report curl from special export to get a complete history of a page.
- considerably expanded reporting and commenting.
- new arguments available to the script include --count_only

There is also now a second script available `xml_to_dataframe.py` which can be used to then process these files and turn them into separate DataFrames. These DataFrames are stored as .feather files and can be loaded with the code below.

You should review the `xml_to_dataframe.py` file as all the operations within that file have been covered in class with the exception of TQDM but you can see how that works in practice.

You will note that this version does not use recursion to count the files. Instead it more literally looks within year and month. This is sufficient for this work, but with a deeper folder structure and one where the structure is less certain this approach would not be robust. On the other hand, by assuming year and month it allows for some interesting statistics about the year and month to be displayed. In your own work you may now consider whether to approach a task with a more general but often more abstract solution or a more specific but often more fragile solution. You can see in Jon's solution that he used a clever way to simply count all the files using a global and letting the global handle the recursion (`download_and_count_revisions_solution.py`).

You should now be able to download a complete history for a single wikipedia page and process that as a DataFrame. Confirm that you can do this with the code yourself. Then discuss among your group:

1. Which two (or more) public figures are worth comparing and why.
2. Prior to any specific time series analysis, consider your expectations for this exploratory comparison.

Draw upon your group's potential expertise in social science to come up with a theoretically informed rationale for a given comparison.


## Merging in Changes to a Repository

First you will want to merge files from an upstream branch (mine). These instructions will show how to do that from the terminal. You will want to be in the oii-fsds-wikipedia folder when entering these commands. Note especially **Step 3**. If you do this it will overwrite `download_wiki_revisions.py` so consider making a backup.

1. **Add the original repository as a remote:**

   ```sh
   git remote add upstream https://github.com/berniehogan/oii-fsds-wikipedia.git
   ```

2. **Fetch the changes from the original repository:**

   ```sh
   git fetch upstream
   ```

3. **Backup any local changes:**
   If you have your own versions of files like `download_wiki_revisions.py`, you should rename the file first to avoid conflicts:

   ```sh
   mv download_wiki_revisions.py download_wiki_revisions_backup.py
   ```

4. **Merge upstream changes into your local main branch:**

   ```sh
   git merge upstream/main
   ```

5. **Resolve any conflicts and commit the changes:**
   You should resolve any conflicts that arise during the merge and then commit the changes:

   ```sh
   git add .
   git commit -m "Merge changes from upstream"
   ```

6. **Push the changes to your GitHub repository:**

   ```sh
   git push origin main
   ```

7. **Test your code after merging:**
   You should test your code to ensure everything works correctly after the integration.

By following these steps, you should be able to integrate the latest changes from my repository while preserving your own custom modifications.


Once this is done, you can use the script below if you wish in order to run the commands directly within a Jupyter notebook rather than via that terminal.


In [1]:
import os
import pandas as pd

# Define articles we want to download
article1 = "Data_science"
article2 = "Machine_learning"

# Create necessary directories if they don't exist
os.makedirs("data", exist_ok=True)
os.makedirs("DataFrames", exist_ok=True)

# Download revisions for both articles
print("Downloading revisions for first article...")
os.system(f'python download_wiki_revisions.py "{article1}"')
print("\nDownloading revisions for second article...")
os.system(f'python download_wiki_revisions.py "{article2}"')

# Convert all downloaded revisions to DataFrames
print("\nConverting revisions to DataFrames...")
os.system('python xml_to_dataframe.py --data-dir ./data --output-dir ./DataFrames')

# Load and verify one of the DataFrames
print("\nVerifying DataFrame contents...")
df = pd.read_feather(f"DataFrames/{article1}.feather")

# Display basic information about the DataFrame
print("\nDataFrame Info:")
print(df.info())

print("\nFirst few rows:")
print(df.head())

# Display some basic statistics
print("\nBasic statistics:")
print(f"Total number of revisions: {len(df)}")
print(f"Date range: from {df['timestamp'].min()} to {df['timestamp'].max()}")
print(f"Number of unique editors: {df['username'].nunique()}")

Downloading revisions for first article...
Downloading complete history of Data_science


Downloading revisions: 35.3MiB [00:01, 23.8MiB/s]


Found 1709 revisions. Organizing into directory structure...


100%|██████████| 1709/1709 [00:04<00:00, 409.03it/s]



Final revision counts:
Found 1709 total revisions for 'Data_science'.

Breakdown by year:
  2012: 91 revisions
  2013: 127 revisions
  2014: 73 revisions
  2015: 143 revisions
  2016: 103 revisions
  2017: 135 revisions
  2018: 190 revisions
  2019: 130 revisions
  2020: 168 revisions
  2021: 133 revisions
  2022: 185 revisions
  2023: 110 revisions
  2024: 121 revisions

Downloading revisions for second article...
Downloading complete history of Machine_learning


Downloading revisions: 238MiB [00:16, 14.4MiB/s] 


Found 3887 revisions. Organizing into directory structure...


100%|██████████| 3887/3887 [00:18<00:00, 214.17it/s]



Final revision counts:
Found 3887 total revisions for 'Machine_learning'.

Breakdown by year:
  2003: 6 revisions
  2004: 33 revisions
  2005: 103 revisions
  2006: 138 revisions
  2007: 130 revisions
  2008: 71 revisions
  2009: 74 revisions
  2010: 132 revisions
  2011: 129 revisions
  2012: 113 revisions
  2013: 96 revisions
  2014: 152 revisions
  2015: 219 revisions
  2016: 261 revisions
  2017: 263 revisions
  2018: 270 revisions
  2019: 293 revisions
  2020: 297 revisions
  2021: 244 revisions
  2022: 298 revisions
  2023: 328 revisions
  2024: 237 revisions

Converting revisions to DataFrames...
Processing with text length only


Processing Machine_learning: 100%|██████████| 4/4 [00:05<00:00,  1.44s/batch]
Processing Data_science:   0%|          | 0/2 [00:00<?, ?batch/s]


Summary for Machine_learning:
Total revisions: 3887
Date range: 2003-05-25 06:03:17+00:00 to 2024-10-21 15:03:51+00:00
Unique contributors: 1098
Average text length: 59622.2 characters


Processing Data_science:  50%|█████     | 1/2 [00:01<00:01,  1.33s/batch]


Summary for Data_science:
Total revisions: 1709
Date range: 2012-04-11 17:34:10+00:00 to 2024-09-04 22:32:11+00:00
Unique contributors: 466
Average text length: 19660.1 characters

Summary for Tokamak:
Total revisions: 5
Date range: 2024-10-19 11:58:26+00:00 to 2024-10-21 09:11:35+00:00
Unique contributors: 4
Average text length: 114381.8 characters

Verifying DataFrame contents...

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
Index: 1709 entries, 479 to 759
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype              
---  ------       --------------  -----              
 0   revision_id  1709 non-null   object             
 1   timestamp    1709 non-null   datetime64[ns, UTC]
 2   username     1123 non-null   object             
 3   userid       1123 non-null   object             
 4   comment      1372 non-null   object             
 5   text_length  1709 non-null   int64              
 6   year         1709 non-null   object             
 7   mont

Processing Data_science: 100%|██████████| 2/2 [00:02<00:00,  1.18s/batch]
Processing Tokamak: 100%|██████████| 1/1 [00:00<00:00, 71.22batch/s]


In [4]:
data = pd.read_feather(f"DataFrames/{article2}.feather")