<a href="https://colab.research.google.com/github/Yuweien/Python-Workshop/blob/main/Python_Workshop_for_Beginners_2_25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ‚Äß‚ÇäÀö‚ú© ‚ÇäÀöüíª‚äπ‚ô° ‚ú® **Python for Absolute Beginners** ‚ú® ‚Äß‚ÇäÀö‚ú© ‚ÇäÀöüíª‚äπ‚ô°


---

# **Session 2: A Hands-on Workflow**

---

**Instructor:** Yuwei Wang  
**Contact:** wangyw@arizona.edu  

Feel free to reach out if you have questions after the workshop.


---

# üëã Welcome! üìÑ‚û°Ô∏èüìÇ Please make a copy of this notebook

üîó The link is in the Zoom chat.

üë§ Please log in to your Google account first.  
Click **‚ÄúOpen in Google Colab‚Äù** in the top-right corner of the page.

üöÄ Then go to **File ‚Üí Save a copy in Drive**  
This will create your own editable copy for today‚Äôs workshop.

## üì£ Preparation before we start:
1. üìÑ Open **your own copy** of this Google Colab notebook.
2. ‚ú® We‚Äôll use the built-in Gemini in Colab today.  
   You can open it by clicking the blue star icon ‚ú¶ at the bottom of the Colab window.

3. ‚è≥ For now, just watch the demonstration.  
   After each small step, I‚Äôll pause and give you time to try it yourself.

4. üîí Please make sure you are working in **your own copy** of the notebook.  
   To keep the live demo and recording clean, please üö® **don‚Äôt edit the instructor‚Äôs version** üö®.  
   I‚Äôll share a completed copy of the notebook with everyone after the session.

### üéØ Today‚Äôs goals

By the end of this session, you will be able to:

- **Recognize and describe a basic data workflow**, from raw CSV to interpretable results.

- **Understand what a üêº pandas DataFrame üêº‡æÄ‡Ω≤ represents** and how it functions as a structured table for analysis.

- **Read and interpret common Python code patterns**, including:
  - loading data
  - cleaning columns
  - merging tables
  - grouping and summarizing data
  - defining and calling simple functions

- **Identify how this workflow could apply to your own project**, especially when:
  - key metadata is stored in separate files
  - different data types require different analysis strategies
  - you need to combine, summarize, or visualize structured data.





---



## ü§î A classroom study: Does AI-assisted study improve vocabulary retention?

In this small classroom study, I have two separate tables.


### 1Ô∏è‚É£ The main dataset (downloaded from D2L quiz report)

- D2L quiz results (CSV export)
  - Student names
  - Vocabulary quiz questions (MC)
  - One reflection question (WR)
  - Scores for the quiz




### 2Ô∏è‚É£ Group information (stored separately)

- Study condition (Group A, B, or C)
  - Group C: No vocabulary review (baseline)
  - Group B: Reviewed textbook vocabulary before the quiz
  - Group A: Studied vocabulary with AI assistance before the quiz



### üö© The challenge

These two tables are stored separately.

If the quiz export does not include group labels, I cannot compare vocabulary retention across conditions.

I need to combine these two tables into one coherent dataset.







---


### üßê This problem shows up elsewhere too


It often happens when key metadata is stored in a separate table. For example:

- Participants are assigned to different treatment groups, but they take the same test.
- Survey responses are stored separately from demographic information.
- Text data is stored separately from coding categories or annotations.

**Core challenge**

How do we combine related pieces of information  
so that the dataset actually reflects the structure of the study?


## üìçüó∫Ô∏è Workflow roadmap

**‚¨ú Import D2L downloaded data**  
&nbsp;&nbsp;&nbsp;&nbsp;‚Üì  
**‚¨ú Inspect & clean**  
&nbsp;&nbsp;&nbsp;&nbsp;‚Üì  
**‚¨ú Add group info (merge)**  
&nbsp;&nbsp;&nbsp;&nbsp;‚Üì  
**‚¨ú Split by question type**  
&nbsp;&nbsp;&nbsp;&nbsp;‚îú‚îÄ‚îÄ **‚¨ú MC** ‚Üí group stats ‚Üí bar chart  
&nbsp;&nbsp;&nbsp;&nbsp;‚îî‚îÄ‚îÄ **‚¨ú Short answer** ‚Üí word freq ‚Üí word cloud



# Step 0. Files and setup

### You are here
- üîµ **Import D2L downloaded data**
- ‚¨ú Inspect & clean
- ‚¨ú Add group info (merge)
- ‚¨ú Split by question type

In this step, we will:
- Load the raw D2L CSV file
- Take a first look at what the data looks like

üëâ Goal: *Understand what we are working with before touching the data.*



Download the dataset here:
[Download D2L fake dataset](https://raw.githubusercontent.com/your-repo/d2l_fake_export.csv)


In [1]:
import pandas as pd
import random
import numpy as np

# ---------------------------
# 1. Create fake students with group
# ---------------------------

first_names = ["Alex", "Jordan", "Taylor", "Morgan", "Riley", "Casey",
               "Jamie", "Avery", "Quinn", "Cameron", "Parker", "Drew"]

last_names = ["Smith", "Johnson", "Lee", "Brown", "Garcia",
              "Martinez", "Davis", "Lopez", "Wilson", "Anderson"]

students = []

groups = ["Group A", "Group B", "Group C"]

for i in range(60):
    first = random.choice(first_names)
    last = random.choice(last_names)
    username = f"{first.lower()}.{last.lower()}{i}"

    # assign group evenly
    group = groups[i // 20]

    students.append((username, first, last, group))

# ---------------------------
# 2. Short answer categories
# ---------------------------

positive_wr = [
    "I think AI can greatly support language learning.",
    "AI tools help me improve grammar and vocabulary.",
    "AI makes feedback faster and more accessible.",
    "I feel confident using AI as a learning aid."
]

neutral_wr = [
    "It depends on how instructors design the assignments.",
    "AI can be useful but needs clear guidelines.",
    "I feel neutral about AI in language class.",
    "It has both benefits and risks."
]

concern_wr = [
    "I worry students may rely too much on AI.",
    "AI reduces real communication practice.",
    "There is a risk of academic dishonesty.",
    "I am concerned about overdependence on AI."
]

# ---------------------------
# 3. Generate dataset with trends
# ---------------------------

rows = []

for username, first, last, group in students:

    # assign WR attitude category by group
    if group == "Group A":
        wr_pool = positive_wr
        mc_prob = 0.8   # higher average score
    elif group == "Group B":
        wr_pool = neutral_wr
        mc_prob = 0.6
    else:
        wr_pool = concern_wr
        mc_prob = 0.4   # lower average score

    for q in range(1, 11):

        if q <= 9:
            q_type = "MC"
            answer = random.choice(["A", "B", "C", "D"])
            score = np.random.choice([1, 0], p=[mc_prob, 1-mc_prob])
        else:
            q_type = "WR"
            answer = random.choice(wr_pool)
            score = None

        rows.append([
            username,
            first,
            last,
            q,
            q_type,
            answer,
            score
        ])

df = pd.DataFrame(rows, columns=[
    "Username",
    "FirstName",
    "LastName",
    "Q #",
    "Q Type",
    "Answer",
    "Score"
])

df.to_csv("d2l_fake_export.csv", index=False)

df.head()


Unnamed: 0,Username,FirstName,LastName,Q #,Q Type,Answer,Score
0,morgan.davis0,Morgan,Davis,1,MC,D,0.0
1,morgan.davis0,Morgan,Davis,2,MC,C,1.0
2,morgan.davis0,Morgan,Davis,3,MC,C,0.0
3,morgan.davis0,Morgan,Davis,4,MC,B,1.0
4,morgan.davis0,Morgan,Davis,5,MC,C,1.0


In [2]:
df

Unnamed: 0,Username,FirstName,LastName,Q #,Q Type,Answer,Score
0,morgan.davis0,Morgan,Davis,1,MC,D,0.0
1,morgan.davis0,Morgan,Davis,2,MC,C,1.0
2,morgan.davis0,Morgan,Davis,3,MC,C,0.0
3,morgan.davis0,Morgan,Davis,4,MC,B,1.0
4,morgan.davis0,Morgan,Davis,5,MC,C,1.0
...,...,...,...,...,...,...,...
595,drew.lee59,Drew,Lee,6,MC,B,0.0
596,drew.lee59,Drew,Lee,7,MC,C,0.0
597,drew.lee59,Drew,Lee,8,MC,D,1.0
598,drew.lee59,Drew,Lee,9,MC,B,0.0


In [None]:
# code goes here


# Step 1. Inspect & clean data

### Workflow status
- ‚úÖ Raw D2L CSV
- üîµ **Inspect & clean**
- ‚¨ú Add group info (merge)
- ‚¨ú Split by question type

In this step, we:
- Inspect columns and basic structure
- Clean obvious issues (extra spaces, column names, unnecessary columns)

üëâ Goal: *Make the data reliable for later steps.*



In [None]:
# code goes here


# Step 2. Add group information (merge)

### Workflow status
- ‚úÖ Raw D2L CSV
- ‚úÖ Inspect & clean
- üîµ **Add group info (merge)**
- ‚¨ú Split by question type

In this step, we:
- Load a separate group roster
- Merge group information into the main dataset

üëâ Goal: *Add meaningful context (groups) to the data.*



In [None]:
# code goes here


# Step 3. Split by question type

### Workflow status
- ‚úÖ Raw D2L CSV
- ‚úÖ Inspect & clean
- ‚úÖ Add group info (merge)
- üîµ **Split by question type**

In this step, we:
- Separate multiple-choice questions from short-answer questions
- Prepare different analysis paths for different data types

üëâ Goal: *Different data types need different analysis strategies.*



In [None]:
# code goes here


# Step 4A. Multiple-choice questions: group stats & visualization

### Workflow status
- ‚úÖ Raw D2L CSV
- ‚úÖ Inspect & clean
- ‚úÖ Add group info (merge)
- ‚úÖ Split by question type
  - üîµ **MC ‚Üí group stats ‚Üí bar chart**
  - ‚¨ú Short answer ‚Üí word freq ‚Üí word cloud

In this step, we:
- Calculate simple statistics by group
- Create a basic bar chart to compare groups

üëâ Goal: *Use simple statistics to answer a teaching or research question.*



In [None]:
# code goes here


# Step 4B. Short-answer questions: text exploration (optional)

### Workflow status
- ‚úÖ Raw D2L CSV
- ‚úÖ Inspect & clean
- ‚úÖ Add group info (merge)
- ‚úÖ Split by question type
  - ‚¨ú MC ‚Üí group stats ‚Üí bar chart
  - üîµ **Short answer ‚Üí word freq ‚Üí word cloud**

In this step, we:
- Explore common words or phrases in open-ended responses
- Visualize themes using word frequency or a word cloud

üëâ Goal: *Get a quick, exploratory sense of what students are saying.*

*(Optional ‚Äî skip if time is limited.)*



# Wrap-up: Adapting this workflow to your own data

We‚Äôve walked through a complete workflow:
- From raw LMS data
- To cleaned, combined, and analyzed results

Think about:
- Which steps are essential for your own project?
- Where might you stop, simplify, or extend the workflow?
- How could AI tools help you modify this code safely?

üëâ The goal is not to memorize code,  
but to **read, understand, and adapt workflows**.
