# Analysis of Content Evolution on Wikipedia
**By: Tanner Martz**

[Link to our GitHub webpage](https://wikipediacontentanalysis.github.io)

## Project Goals
Our primary objective is to examine the evolution and growth of content on Wikipedia. We aim to understand:
1. How article length and references have changed over time.
2. The growth rate of multimedia (images, videos) within articles.
3. User collaboration patterns in editing Wikipedia articles.

## Project Dataset
We will harness the [Wikipedia's Revision History dataset](https://datasets.wikimedia.org/revision-history/). This dataset provides a comprehensive log of edits made to Wikipedia articles. From this, we can glean insights about content evolution, user collaboration, and overall growth patterns of Wikipedia articles.

Additionally, to understand multimedia growth, we will explore the [Wikipedia's Multimedia dataset](https://datasets.wikimedia.org/multimedia/). This will allow us to quantify the integration and relevance of multimedia content in Wikipedia articles over the years.

## Collaboration Plan
Our team will convene on a bi-weekly basis via Zoom to review findings, align on next steps, and address any challenges. We'll utilize GitHub for version control and collaboration. Our immediate plan includes:
- Week 1-2: Data extraction and preliminary cleaning.
- Week 3-4: Initial analysis and visualization of findings for content growth.

## ETL (Extraction, Transform, and Load)
We've initiated the process by extracting a subset of the Wikipedia Revision History dataset. The dataset includes article titles, timestamps, user IDs, and edit sizes. We intend to transform this data by consolidating edits by month and year to understand content evolution patterns.

## Preliminary Dataset View 
Following is a preliminary view of our dataset, showcasing how Wikipedia content has evolved over a specific timeframe:

In [None]:
import pandas as pd
wiki_df = pd.read_csv('https://sample.wikipedia.revision.history.csv')
wiki_df.head()

## Next Steps
Our immediate next steps involve:
1. Data cleaning: Handling missing values and outliers in the dataset.
2. Exploratory Analysis: Understanding the distribution and growth pattern of edits.
3. Multimedia Analysis: Examining the integration rate of multimedia content in articles.