## Part 1: Getting Started with Pandas and the Stack Overflow Survey Dataset

### 1. Setup and Installation

We'll assume you have Python ≥3.10 (Pandas 2.x requires 3.10+). 
Install via pip:

In [None]:
!pip install --upgrade pandas

Or, if using conda:

```bash
conda install pandas
```

Verify the version so we know what behavior to expect (current stable series is 2.3.x as of mid-2025; Pandas 2.0 introduced breaking changes from 1.x and removed deprecated APIs, so older tutorials may mention things that no longer exist).

In [None]:
import pandas as pd
print(pd.__version__)  # Expect 2.2.x / 2.3.x; warn if <2.0

**Note:** Pandas 2.0+ enforces previously deprecated behavior from 1.x, so if the code uses deprecated things (e.g., references to `.ix` or reliance on certain implicit dtype coercions) so you may need slight adaptation.

---

### 2. Download and prepare the Stack Overflow 2024 Survey data

1. Go to the official Stack Overflow Developer Survey page and download the **2024** full data set (CSV).  
   - The public results file is named `survey_results_public.csv`.  
   - There's a companion schema file (e.g., `survey_results_schema.py`) which maps column codes to human-readable questions.

2. Unzip the downloaded archive and rename the extracted folder to `data` in your working directory.

3. Confirm that inside `data/` you have at least:
   - `survey_results_public.csv`
   - `survey_results_schema.py`
   - `README` (explains the files and structure)

*The dataset comes from the official 2024 survey; it has \~65,000 responses and the public CSV is one respondent per row and each column is an answer.* ([survey.stackoverflow.co][1], [Kaggle][2])

---

[1]: https://survey.stackoverflow.co/?utm_source=chatgpt.com "Stack Overflow Annual Developer Survey"
[2]: https://www.kaggle.com/code/dima806/stack-overflow-survey-salaries-2023-vs-2024/input?utm_source=chatgpt.com "Stack Overflow Survey salaries 2023 vs 2024 - Kaggle"

### 3. CSV vs Excel (Why we’re using CSV here)

- **CSV** is plain text with delimiter-separated values. It's lightweight, universally readable, and ideal for data interchange between systems.  
- **Excel** (`.xlsx`, `.xls`) is a richer binary/XML format supporting multiple sheets, formatting, formulas, etc., but requires more complex parsing and proprietary support.  
- For large-scale programmatic analysis and reproducibility, CSV is preferred because it's simple, version-control friendly, and doesn’t embed presentation metadata.

*References for the distinctions:* ([DataCamp][3], [GeeksforGeeks][4], [Spreadsheet Planet][5])

---

[3]: https://www.datacamp.com/blog/csv-vs-excel?utm_source=chatgpt.com "CSV vs Excel: Making the Right Choice for Your Data Projects"
[4]: https://www.geeksforgeeks.org/excel/difference-between-csv-and-excel/?utm_source=chatgpt.com "Difference Between CSV and Excel - GeeksforGeeks"
[5]: https://spreadsheetplanet.com/csv-vs-xlsx-files/?utm_source=chatgpt.com "CSV vs. XLSX Files - What's the Difference? - Spreadsheet Planet"

### 4. First Pandas Usage

In [None]:
import pandas as pd  # standard alias

In [None]:
df = pd.read_csv('data/survey_results_public.csv')

In [None]:
# Number of rows and columns
print(df.shape)

# Summary of types, non-null counts
df.info()

Sometimes the default display truncates columns/rows. Adjust:

In [None]:
pd.set_option('display.max\_columns', 114)  # show more columns horizontally
pd.set_option('display.max\_rows', 80)     # show more rows if needed

Viewing data slices

In [None]:
df.head()        # first 5 rows
df.head(10)      # first 10 rows
df.tail()        # last 5 rows

---

### Exercise for Part 1

There is an additional .csv file in the data folder.  

1. Load the schema file into a DataFrame.
2. Answer the following:

   * How many rows and columns does the schema DataFrame have?
   * Inspect the first few rows to understand its structure. What are the key fields/columns it contains, and what do they mean in context of the survey?
   * Are there any missing values in the schema? Identify which column(s), if any, have missing entries and how many.