# Scraping Wikipedia revisions

The following methods / scripts are from: https://github.com/berniehogan/oii-fsds-wikipedia

## Merging in Changes to a Repository 

First you will want to merge files from an upstream branch (mine). These instructions will show how to do that from the terminal. You will want to be in the oii-fsds-wikipedia folder when entering these commands. Note especially **Step 3**. If you do this it will overwrite `download_wiki_revisions.py` so consider making a backup. 

1. **Add the original repository as a remote:**
   ```sh
   git remote add upstream https://github.com/berniehogan/oii-fsds-wikipedia.git
   ```

2. **Fetch the changes from the original repository:**
   ```sh
   git fetch upstream
   ```

3. **Backup any local changes:**
   If you have your own versions of files like `download_wiki_revisions.py`, you should rename the file first to avoid conflicts:
   ```sh
   mv download_wiki_revisions.py download_wiki_revisions_backup.py
   ```

4. **Merge upstream changes into your local main branch:**
   ```sh
   git merge upstream/main
   ```

5. **Resolve any conflicts and commit the changes:**
   You should resolve any conflicts that arise during the merge and then commit the changes:
   ```sh
   git add .
   git commit -m "Merge changes from upstream"
   ```

6. **Push the changes to your GitHub repository:**
   ```sh
   git push origin main
   ```

7. **Test your code after merging:**
   You should test your code to ensure everything works correctly after the integration.

By following these steps, you should be able to integrate the latest changes from my repository while preserving your own custom modifications.

Once this is done, you can use the script below if you wish in order to run the commands directly within a Jupyter notebook rather than via that terminal. 

In [None]:
import os
import pandas as pd

# Define articles we want to download
article1 = "Data_science"
article2 = "Machine_learning"

# Create necessary directories if they don't exist
os.makedirs("data", exist_ok=True)
os.makedirs("DataFrames", exist_ok=True)

# Download revisions for both articles
print("Downloading revisions for first article...")
os.system(f'python download_wiki_revisions.py "{article1}"')
print("\nDownloading revisions for second article...")
os.system(f'python download_wiki_revisions.py "{article2}"')

# Convert all downloaded revisions to DataFrames
print("\nConverting revisions to DataFrames...")
os.system('python xml_to_dataframe.py --data-dir ./data --output-dir ./DataFrames')

# Load and verify one of the DataFrames
print("\nVerifying DataFrame contents...")
df = pd.read_feather(f"DataFrames/{article1}.feather")

# Display basic information about the DataFrame
print("\nDataFrame Info:")
print(df.info())

print("\nFirst few rows:")
print(df.head())

# Display some basic statistics
print("\nBasic statistics:")
print(f"Total number of revisions: {len(df)}")
print(f"Date range: from {df['timestamp'].min()} to {df['timestamp'].max()}")
print(f"Number of unique editors: {df['username'].nunique()}")

Downloading revisions for first article...
Downloading complete history of Data_science


Downloading revisions: 35.3MiB [00:01, 21.2MiB/s]


Found 1709 revisions. Organizing into directory structure...


100%|██████████| 1709/1709 [00:05<00:00, 285.91it/s]



Final revision counts:
Found 1709 total revisions for 'Data_science'.

Breakdown by year:
  2012: 91 revisions
  2013: 127 revisions
  2014: 73 revisions
  2015: 143 revisions
  2016: 103 revisions
  2017: 135 revisions
  2018: 190 revisions
  2019: 130 revisions
  2020: 168 revisions
  2021: 133 revisions
  2022: 185 revisions
  2023: 110 revisions
  2024: 121 revisions

Downloading revisions for second article...
Downloading complete history of Machine_learning


**See original repository to see how the code can be altered to download nor merely revisions, but full Wikipedia pages over time**