# W200: Project 2 - Exploratory Data Analysis

Welcome to your final project for DATASCI 200! This is your opportunity to apply the `python`, `numpy`, and `pandas` skills you've learned throughout the semester to explore a dataset that genuinely interests you. You'll work in a team to find a dataset, ask interesting questions about it, and present your findings.

---

## 📅 Key Deadlines

* **Week 10:** Propose your dataset in the [Team Sign-up Sheet](https://docs.google.com/spreadsheets/d/1BK1E7BxYBo5eVU2Vq7N6vOaKiWkZ8tRKdSukUWAp4KM/edit?gid=155815194#gid=155815194) by **11:59 PM PST the day before class**.
* **Week 11:** Submit your formal **Project Proposal** to your team's GitHub repository and Gradescope by **11:59 PM PST the day before class**.
* **Week 14 (During Class):** Your team will give a **10-minute presentation**.
* **Week 14 (After Class):** Submit your final **Jupyter Notebook Report** to GitHub and Gradescope by **11:59 PM PST the day after class**.

---

## 📋 Project Breakdown

This project has five main components: data sourcing, team formation and repo setup, the proposal, an in-class presentation, and a final report.

### **Part 0: Dataset Sourcing**
After week 9, find a dataset on the internet that you're interested in. The dataset should be public (no private data sources, even if you work at the owning company). 
* Make sure it has **at least 100 rows and 6 or more interesting columns**. 
* You may combine multiple datasets. 
* Do not use synthetic / simulated datasets (these are especially common on Kaggle).

Here are some great places to start your search:
  * [Awesome Public Datasets](https://github.com/caesar0301/awesome-public-datasets)
  * [Data is Plural](http://tinyletter.com/data-is-plural/archive)
  * [Data.gov](http://data.gov/)
  * [Google Dataset Search](https://datasetsearch.research.google.com/)
  * [Hugging Face](https://huggingface.co/datasets)
  * [Kaggle](https://www.kaggle.com/datasets)  
  * [kdnuggets](https://www.kdnuggets.com/datasets/index.html)
  * [Open Data Inception](https://opendatainception.io/)
  * [Open Data Soft](https://data.opendatasoft.com/explore/dataset/open-data-sources%40public/table/?sort=code_en)
  * [Reddit/r/Datasets](https://www.reddit.com/r/datasets)
  * [... and more](https://www.google.com/)

Propose your dataset in the [Team Sign-up Sheet](https://docs.google.com/spreadsheets/d/1BK1E7BxYBo5eVU2Vq7N6vOaKiWkZ8tRKdSukUWAp4KM/edit?gid=155815194#gid=155815194) by **11:59 PM PST the day before week 10's class**.


### **Part 1: Team & GitHub Setup (5% of Grade)**

1.  **Form a Team:** Your team must include a minimum of 2 and a maximum of 4 students, and all teammates should come from the same section. Add your team members to the [Project 2 Team Formation Sign-up Sheet](https://docs.google.com/spreadsheets/d/1BK1E7BxYBo5eVU2Vq7N6vOaKiWkZ8tRKdSukUWAp4KM/edit?gid=155815194#gid=155815194). We recommend making a private slack channel for yourselves to facilitate communication (do not add the instructors).

2. **Select a Dataset:**  
    * Finalize the selection of a dataset that one of you have sourced. 
    * If you don't know if a dataset would be suitable, please email it to the instruction team for feedback before writing your proposal.

3.  **Create a GitHub Repository:**
    * Set up a new repository in the `UC-Berkeley-I-School` organization.
    * Make sure your repository is private.
    * Name it using this format: `200-[SEMESTER][YEAR]-EDA-[TEAMNAME]`. For example: **200-FA25-EDA-Climate-Change**.
    * Invite your teammates as collaborators and the W200 instructors as viewers.

4.  **Create a `CHANGELOG.md`:**
    * Right away, create a file named `CHANGELOG.md` in your repository.
    * Every time a team member makes a meaningful contribution, add an entry. This log helps track your project's progress and individual contributions.
    * Use the [common changelog](https://common-changelog.org/#243-authors) format like so:
        ```md
        - 2025-09-23 Fixed column interpolation function (Mumin Khan)
        - 2025-09-24 Added join on secondary dataset (Paul Laskowski)
        ```

### **Part 2: The Proposal (10% of Grade)**

This is a 1-2 page document where you outline your project plan.

**Your proposal must include:**
* The names of all team members.
* The full name of and link to your team's GitHub repository.
* Your planned mode of communication and any regular meeting times from now until week 14.
* A description of the primary dataset you plan to analyze, including a link to it.
* A list of the variables (columns) you find most interesting and what you hope to learn from them.
* Information on any supplemental datasets you might use to enrich your analysis.
* A timeline of milestones that you collectively agree to adhere to from now until the submission of the final report.

### **Part 3: The In-Class Presentation (20% of Grade)**

During our final class, your team will present your findings in PPT format for **10 minutes**, followed by a 5-minute Q&A.

**Your presentation should:**
* Clearly state the main questions your analysis sought to answer.
* Walk the audience through your data exploration and cleaning process.
* Explain any assumptions you made.
* Tell a compelling story with your data, using key charts and graphs to guide your narrative.
* You don't need to show your code, but you might get follow-up questions about how you treated your data.

### **Part 4: The Final Report (65% of Grade)**

This is the main deliverable: a polished Jupyter Notebook that details your exploratory data analysis. The goal is to tell a clear and compelling story about your findings. There are no required section headers, as each EDA will be different. We've included a sample structure to get you started below.

**Keys to a Great Report:**
* **Tell a Story:** Your analysis should be a written argument. Guide the reader through your process and what the data reveals.
* **Explain Your Figures:** Every single plot or graph should be explained. If a visualization isn't important enough to write about, don't include it.
* **Check your syntax:** Your notebook should be runnable from top to bottom without user intervention or errors. 
* **Document Everything:** Clearly state if you remove data points, transform variables (e.g., taking a logarithm), or have to handle suspicious values. A simple sentence is often enough.
* **Stick to Descriptive Statistics:** Your analysis should focus on summarizing the data you have. Avoid making broad claims about a larger population. **Do not use words like "significant,"** which imply statistical testing we haven't covered.

Your report should not be a comprehensive compilation of everything your team tried. It should be a thoughtful selection of text and code that tell the story you want it to. A helpful rule of thumb is to pretend a VP of Engineering will be reading your notebook; your code should be present, but so too should explanations before and after each code block. 

### Submission

All files for project 2 should be uploaded to GitHub.

1. CHANGELOG.md
      * Should be created shortly after the repo is created
      * Is to be updated whenever new work is done
2. Proposal
     * Upload a PDF of your proposal to GitHub and email the link to your section instructor, CC-ing the rest of your team. 1 submission per team. 
3. Presentation
     * Upload a PDF of your presentation to GitHub after class on week 14.
4. Final Report (.ipynb)
     * A single readable notebook of the exploratory data analysis should be uploaded to GitHub and Gradescope. 
5. Additional files
      * If your dataset is <= 100MB, upload it to GitHub
      * All individual / prototyping notebooks should also be uploaded to GitHub
      * You may package reusable functions in .py files uploaded to GitHub

Please ensure that your GitHub is up to date, even if you submit to Gradescope.

---


## Collaboration Tips

Collaborating on a Jupyter Notebook with GitHub can be tricky because notebooks are complex JSON files, making merge conflicts difficult to resolve. The best approach is to communicate frequently and establish a clear workflow.


#### **GitHub Workflow**

The fundamental workflow for team collaboration on GitHub is: **pull, edit, commit, push**.

1.  **Pull Before You Work:** Always run `git pull` before you start making any changes. This fetches the latest version of the project from the repository, ensuring you're not working on an outdated copy.
2.  **Communicate Your Work:** Let your teammates know which part of the notebook you are editing. The simplest way to avoid conflicts is to not have two people edit the same notebook at the same time. 
3.  **Commit and Push Often:** Make small, frequent commits with clear messages. Instead of working for hours and making one large push, push your changes after completing a small task. This keeps the project updated and makes it easier to track changes.

#### **Avoiding Merge Conflicts**

Merge conflicts are the biggest challenge. They happen when two people change the same lines in a file. With notebooks, even running a cell can change the underlying JSON, leading to a conflict.

* **Work in Separate Notebooks:** One effective strategy is for each team member to work on a separate notebook for their part of the analysis. At the end, you can combine the finalized code and visualizations into the main report notebook.
* **Clear Output Before Committing:** Notebook output (like plots and tables) adds a lot of data to the JSON file. It’s a good practice to clear all cell outputs (`Cell > All Output > Clear`) before you commit and push your changes. This reduces the file size and the likelihood of conflicts.
* **Update Your `CHANGELOG.md`:** Remember to update the changelog with a summary of your contribution before you push. This keeps everyone informed about the project's evolution.

---

#### **Using Google Colab with GitHub**

Google Colab provides a great way for teams to work together without worrying about local Python environments.

1.  **Opening a Notebook from GitHub:**
    * Open [Google Colab](https://colab.research.google.com/).
    * Go to `File > Open notebook`.
    * Select the **GitHub** tab.
    * Enter the repository URL (e.g., `https://github.com/UC-Berkeley-I-School/200-FA25-EDA-Climate-Change`) and press Enter.
    * Click on the notebook you want to open.

2.  **Saving a Notebook to GitHub:**
    * After making your changes in Colab, go to `File > Save a copy in GitHub`.
    * You'll be asked to authorize Colab to access your GitHub account.
    * Choose the correct repository and branch from the dropdown menus.
    * Add a clear commit message describing your changes and click **OK**.

This method allows team members to easily access and contribute to the project from any machine with an internet browser.



# Example Project Structure

Below is a sample of how an EDA might be structured. This is not a prescriptive requirement, and your sections should change based on the dataset you've selected.


---

#### 1. Introduction & Guiding Questions

* **Purpose:** Start here to set the stage for your analysis.
* **Content:**
    * Briefly introduce your chosen topic and why it's interesting.
    * State the primary dataset(s) you are using.
    * Clearly list the 2-4 main questions your analysis will explore. These questions will guide the rest of your report.

#### 2. Data Preparation & Cleaning

* **Purpose:** This section fulfills the "Data Cleaning / Sanity Checks" requirement. The instructions state you must document your decisions, such as removing observations or transforming variables.
* **Content:**
    * **2.1. Loading the Data:** Import your primary and any supplemental datasets. Show the initial `.head()` and `.info()` to give a first look.
    * **2.2. Merging Datasets (if applicable):** If you joined multiple datasets, explain which columns you used to perform the join.
    * **2.3. Data Cleaning:** Detail the steps you took to clean the data. Be explicit.
        * Handling missing values (e.g., "We filled missing 'Age' values with the median age.").
        * Correcting data types (e.g., "The 'Date' column was converted to datetime objects.").
        * Addressing suspicious or outlier values, and justifying whether you kept or removed them.
    * **2.4. Feature Engineering / Transformation:** Explain any new variables you created or transformations you applied (e.g., "We took the logarithm of the 'Price' variable to normalize its distribution.").

#### 3. Exploratory Analysis & Findings

* **Purpose:** This is the core of your report, where you build your "written argument" and tell "compelling data stories". Organize this section around the questions from your introduction.
* **Content:** Structure this section by question.
    * **3.1. Analysis for Question 1: [Your First Question]**
        * Use text, code, and visualizations to explore the data relevant to this question.
        * "Characterize relationships between variables". For example, show a scatter plot and describe the relationship you observe.
        * **Every single figure must be explained in your text.** Describe what the chart shows and why it's important for answering your question.
    * **3.2. Analysis for Question 2: [Your Second Question]**
        * Repeat the process, using analysis and well-annotated figures to guide the reader through your findings.

#### 4. Assumptions & Limitations

* **Purpose:** The presentation guidelines require you to discuss assumptions made during the analysis. Including them in the report demonstrates critical thinking.
* **Content:**
    * Briefly list any assumptions you made (e.g., "We assumed that the data collected in 2022 is representative of the entire year.").
    * Mention any limitations of your dataset or analysis (e.g., "Our dataset only included information from one city, so findings may not apply elsewhere.").

#### 5. Conclusion

* **Purpose:** Summarize the key takeaways from your analysis.
* **Content:**
    * Briefly recap your main findings as they relate to your initial questions.
    * Conclude your data story. What is the final message you want the reader to take away from your exploration?