#  Day 1  Big Picture, Tools Landscape & Data Handling in Python

Open in Google Colab (if you have not done so yet!): [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ValRCS/RTU_Data_Analysis_Visualization_CPD/blob/main/notebooks/day1_big_picture_intro.ipynb)

## Course Description

This course provides a practical introduction to **data analysis** and **data visualization** using Python and modern AI-assisted tooling.  
Students will learn to work through the complete analytics workflow — from data collection, cleaning, and preparation, through exploratory data analysis (EDA), modeling, and visualization — with a focus on producing clear, actionable insights.  

Core skills are built using Python’s rich data ecosystem, including **pandas**, **matplotlib**, **seaborn**, **plotly**, and **scikit-learn**.  
In addition, we explore the use of **AI tools** such as ChatGPT, GitHub Copilot, and agent frameworks to enhance productivity, automate repetitive tasks, and assist in coding, exploration, and communication.  

By the end of the course, students should have a basic idea on how to analyze datasets, create compelling visualizations, build simple dashboards, and integrate AI assistance into their data workflows.

<div style="display: flex; align-items: center; justify-content: center; gap: 30px;">
  <img src="https://github.com/ValRCS/RTU_Data_Analysis_Visualization_CPD/blob/main/img/RTU_Analysis_Puzzled.png?raw=true" width="400"/>

  <!-- Thick arrow as inline SVG -->
  <span style="font-size:60px; line-height:20px; color:rgb(0,88,84);">⇒</span>

  <img src="https://github.com/ValRCS/RTU_Data_Analysis_Visualization_CPD/blob/main/img/RTU_1862_wall.png?raw=true" width="400"/>
</div>

## About Instructor

 Valdis Saulespurēns is a lecturer at Riga Technical University, where he teaches Python, JavaScript, algorithms, AI and other computer science subjects. He also works at as a researcher and developer at the National Library of Latvia. Valdis has a specialization in Machine Learning and Data Analysis, and he enjoys **transforming disordered data into structured knowledge**. With more than 30 years of programming experience, Valdis began his professional career by writing programs for quantum scientists at the University of California, Santa Barbara. Before moving into teaching, he developed software for a radio broadcast equipment manufacturer. Valdis holds a Master's degree in Computer Science from the University of Latvia.

 E-mail: valdis.saulespurens@rtu.lv

## Prerequisites

- 📄 Reasonable knowledge of written English - most materials are in English - Latvian materials are sparse
- 💻 Little or no programming experience  
- 📊 Basic familiarity with spreadsheets such as Excel  
- 📈 Interest in data analysis and visualization  
- ⏳ Ability to set aside a few hours every week for practice and exercises  

### Computer Requirements

- 🌐 Internet connection with any up-to-date web browser (e.g., Google Chrome, Mozilla Firefox, Microsoft Edge, Safari)  
- 📧 A Gmail account for accessing Google Colab and related tools  

**Optional but Recommended (for local development - cloud access can be run on a potato):**  
- 🖥 Operating System: Windows 10/11, macOS 12 or later, or a modern Linux distribution  
- ⚙️ Processor: Dual-core CPU (Intel i5/Ryzen 3 or better)  
- 🛠 Memory: 8 GB RAM or more (16 GB preferred)  
- 💾 Storage: At least 20 GB of free SSD space for software, data files, and projects  
- 🖥 Screen: Minimum 13" display with 1920×1080 resolution or higher  
- 🛠 Software: Ability to install and run Visual Studio Code, Python, and Jupyter Notebooks locally  

## Limitations

- ⏳ Limited course time: **5 days × 5 academic hours** means we focus on core concepts rather than exhaustive coverage  
- 🎣 This course is more about **teaching you to fish** — and even **teaching you how to find more fishing instructions** — than providing a complete manual  
- 🐍 Concentrates on **Python with pandas** through **Jupyter Notebooks on Google Colab**, rather than traditional `.py` file programming on a local machine  
- 🤖 Focuses on a **small subset of AI tooling** (e.g., ChatGPT, GitHub Copilot) instead of covering the full range of options such as Gemini or Claude Code  
- 🧭 The goal is to provide enough **"hooks"** so students can confidently continue learning and know **what to research next** after the course  

## Course Structure

Every day will consist of:
- 🗣 **Lecture**: 3 academic hours of theory and practice with short breaks
- 💻 **Hands-on Practice**: 1 academic hour of exercises and projects
- 🔄 **Reflection**: 1 academic hour of Q&A, discussion, and feedback and quizzes

## Other Course resources

There is also a course page at RTU DAS platform: [Python Data Analysis and Visualization](https://www.das.lv/platforma/course/view.php?id=34) - this will require that course administrator has added you to the course - we will use this platform for quizzes and feedback.

## Key terms - the map of the territory in the course

### General Data and Computing Terms

* Data Analysis - The process of inspecting, cleaning, transforming, and modeling data to discover useful information, inform conclusions, and support decision-making.
* Data Visualization - The graphical representation of information and data, using visual elements like charts, graphs, and maps to make complex data more accessible, understandable, and usable.
* Data Science - An interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.
* Big Data - Extremely large datasets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions.
*Informally - Big Data - what does not fit into local computer memory, or is too large to be processed by a single machine in a reasonable time.*
* Cloud Computing - The delivery of computing services over the internet ("the cloud"), allowing for on-demand access to a shared pool of configurable computing resources (e.g., servers, storage, databases, networking, software).
*Informally - Cloud Computing - computing on someone else's computer, which you access via the internet.*

### AI and Machine Learning Terms
* Machine Learning - A subset of artificial intelligence (AI) that involves the use of algorithms and statistical models to enable computers to perform tasks without explicit instructions, relying on patterns and inference instead.
* LLM - Large Language Model, a type of AI model trained on vast amounts of text data to understand and generate human-like language, enabling applications like chatbots, translation, and content generation. LLMs are type of Machine Learning model.
* Github Copilot - An AI-powered code completion tool developed by GitHub, which uses machine learning to suggest code snippets and entire functions as developers write code in various programming languages.
* ChatGPT - A conversational AI model developed by OpenAI, based on the GPT architecture, one of many LLMs, designed to understand and generate human-like text responses in a conversational context.


### Programming and Data Handling Terms
* Programming - The process of designing and building executable computer software to accomplish a specific task or solve a problem, typically involving writing code in a programming language.
* Python - A high-level, interpreted programming language known for its readability and versatility, widely used in data analysis, web development, automation, and more.
* Pandas - A powerful open-source data manipulation and analysis library for Python, providing data structures
* Jupyter Notebook - An open-source web application that allows you to create and share documents containing live code, equations, visualizations, and narrative text, widely used for data analysis and scientific computing.
* Google Colab - A free and optionally paid, cloud-based Jupyter notebook environment provided by Google that allows users to write and execute Python code in the browser, with access to powerful computing resources.

### Any data terms you would like to add?




## 📅 Day 1 — Plan -  Big Picture, Tools Landscape & Data Handling in Python

**📚 Instruction (3h)**  
- 🌐 Big picture: data analysis workflow  
- ⚖️ Tools comparison: Python vs Excel, Power BI, Tableau (how they complement each other)  
- 📓 Jupyter Notebooks for interactive analysis & documentation  
- 🐍 Python basics for data work  
- 📊 Intro to `pandas` DataFrames  
- 📥 Data import (CSV, Excel, JSON) & inspection  

**🛠 Practical (1h)**  
- 📂 Load & filter a dataset  
- ➗ Compute basic statistics  

**🔄 Reflection (1h)**  
- 💬 Group discussion: comparing workflows in Python vs spreadsheets  
- 📝 Mini quiz on `pandas` basics  
- 🐞 Troubleshoot common import/inspection issues

## 🧭 Data Analysis & Visualization Workflow

### 1. 🎯 Defining the Problem
- ✅ Clarify goals – What are you trying to understand, explain, or predict?  
- 👥 Identify stakeholders – Who will use the results? What decisions will it support?  
- ❓ Formulate measurable questions – Turn vague objectives into specific, testable ones  
  - *Example:* Instead of “Improve sales,” ask: “What factors most influence monthly sales volume?”  

### 2. 📥 Data Acquisition
- 📂 Collect from:
  - 🏢 Internal sources (databases, transaction logs, CRM systems)  
  - 🌍 External sources (APIs, surveys, government datasets, sensors)  
- 🔑 Check accessibility – permissions, licenses  
- 📝 Document sources for reproducibility  

### 3. 🧹 Data Cleaning & Preparation (50–80% of the work)
- 🔧 Handle missing values – fill in, remove, or flag  
- 🗑 Remove duplicates & errors  
- 📅 Standardize formats – dates, units, currency, text casing  
- 🏗 Feature engineering – derive new variables from existing data  
- 🔗 Combine datasets – merge or join multiple sources  

### 4. 🔍 Exploratory Data Analysis (EDA)
- 📊 Summarize with descriptive statistics – mean, median, variance, distributions  
- 👀 Spot patterns and trends  
- 🚨 Detect anomalies or outliers  
- 💡 Test initial hypotheses  
- 🖼 Create simple visualizations – histograms, scatter plots, boxplots  

### 5. 📐 Modeling & Analysis
- 🧮 Choose an approach:
  - 📊 Statistical tests (t-tests, ANOVA, correlation)  
  - 🤖 Predictive modeling (linear regression, decision trees, neural networks)  
  - 🧩 Clustering & segmentation  
- 📏 Evaluate performance – accuracy, error rates, R², precision/recall, etc.  
- 🧐 Validate assumptions – ensure the method fits the data  

### 6. 📈 Data Visualization
- 📊 Choose the right chart for the question:
  - 📊 Comparisons → bar charts  
  - 📉 Distributions → histograms, boxplots  
  - 📆 Trends over time → line charts  
  - 🔗 Relationships → scatter plots  
  - 🥧 Parts of a whole → pie/donut charts (sparingly)  
- 🎨 Apply design principles:
  - 🧩 Keep it simple (avoid “chart junk”)  
  - 🎨 Use consistent scales and colors  
  - 🏷 Label clearly  
- 🖥 Consider interactive tools – dashboards, filters, drill-downs  

### 7. 📖 Interpretation & Storytelling
- 🤔 Explain the *why* behind patterns  
- 🔗 Link results back to the original question  
- 🎯 Highlight implications for decision-making  
- ⚠️ Acknowledge limitations and uncertainty  

### 8. 📢 Communication & Delivery
- 📑 Reports – PDF, Word, LaTeX  
- 📊 Dashboards – Tableau, Power BI, Plotly Dash  
- 🎤 Live presentations – slides, interactive demos  
- 🧑‍🤝‍🧑 Tailor depth and format to the audience (executives, technical teams, general public)  

### 9. 🚀 Action & Monitoring
- 🛠 Implement recommendations  
- 📊 Track metrics over time  
- 🔄 Iterate as new data comes in  
- 📈 Adjust strategies based on feedback and results

## 🛠 Tools Landscape

| Tool     | Strengths | Limitations | Best For |
|----------|-----------|-------------|----------|
| **Python** (Open-source, code-based) | • Automation & reproducibility  <br>• Handles large datasets  <br>• Flexible analysis  <br>• Integration with APIs/databases  <br>• Huge library ecosystem | • Steeper learning curve  <br>• Requires coding knowledge | Complex, custom, or large-scale analyses; automated pipelines; integrating multiple data sources |
| **Excel** (Spreadsheet-based) | • Widely used in business  <br>• Easy to learn  <br>• Quick calculations & charts  <br>• Strong for small datasets | • Limited scalability  <br>• Weak automation  <br>• Prone to manual errors | Quick ad-hoc analysis and reporting; small datasets; business users without coding skills |
| **Power BI** (Microsoft BI platform) | • Powerful dashboards  <br>• Microsoft ecosystem integration  <br>• Good data modeling tools  <br>• Easy sharing | • Less flexible for custom analysis  <br>• Desktop license cost | Interactive business dashboards; automated refresh from Microsoft data sources; corporate reporting |
| **Tableau** (Visualization & storytelling) | • Best-in-class visualizations  <br>• Strong storytelling tools  <br>• Drag-and-drop interface  <br>• Good interactivity | • Limited data prep capabilities  <br>• License cost | Creating polished, interactive visuals and stories for presentations; visual exploration of data |


## 📓 Jupyter Notebooks — The Very Basics

Jupyter Notebooks are an interactive environment that combines **code**, **text**, and **results** in one place.  
They are especially useful for **data analysis**, **experiments**, and **teaching**.  

### 🔑 Key Features
- **Cells**: The notebook is divided into cells that can contain either:
  - **Code cells** (Python or other languages)
  - **Markdown cells** (formatted text, math, images, links)
- **Interactive execution**: Run code cell-by-cell and see results immediately.
- **Documentation and narrative**: Mix explanations with code to tell the story of your analysis.

### 🛠 Basic Actions
- ▶️ **Run a cell**: `Shift + Enter`  
- ➕ **Add a new cell**: `+` button or `Insert` menu  
- ✏️ **Switch cell type**: Code ↔ Markdown (use toolbar or menu)  
- 🔄 **Restart kernel**: Clears memory and restarts execution environment  

### ✅ Why Use Jupyter?
- Great for **step-by-step analysis**.  
- Supports **visualizations** inline.  
- Makes it easy to **share reproducible research**.  


## 🌐 Google Colab — Jupyter in the Cloud

[Google Colab](https://colab.research.google.com/) is a free, cloud-based platform that runs **Jupyter Notebooks** without any local installation.  
It’s widely used for **teaching**, **data analysis**, and **machine learning**.

### 🔑 Key Features
- **No installation required**: Works entirely in your web browser.  
- **Free compute resources**: Provides CPU, GPU, and sometimes TPU acceleration.  
- **Google Drive integration**: Save and load notebooks directly from your Google Drive.  
- **Collaboration**: Share notebooks just like Google Docs for real-time teamwork.  
- **Preinstalled libraries**: Many data science and AI libraries (e.g., pandas, NumPy, TensorFlow, PyTorch) are already included.  

### 🛠 How It Relates to Jupyter
- **Same interface & workflow**: Colab is essentially Jupyter Notebook hosted in the cloud.  
- **Compatible format**: You can open `.ipynb` files from local Jupyter in Colab and vice versa.  
- **Extra conveniences**: GPU/TPU hardware, easy sharing, integration with Google ecosystem.  

### ⚖️ Trade-offs
- **Pros**: No setup, free compute, easy collaboration.  
- **Cons**: Requires internet connection, limited runtime/session length, dependency on Google’s environment.  


### Local vs Cloud

When choosing between local Jupyter Notebooks and Google Colab, consider:
- **Local Jupyter**:
  - Pros: Full control over environment, no internet dependency, can handle larger datasets without session limits.
  - Cons: Requires setup, may need local resources (RAM, CPU), less convenient for collaboration. 

- **Google Colab**:
  - Pros: No setup, free compute, easy collaboration, preinstalled libraries.   
  - Cons: Internet dependency, session limits, less control over environment.


I will provide instructions for setting up local Jupyter Notebooks for those intersted after a few days, once we have covered the basics of data handling in Python.

## ✍️ Markdown — The Basics

Markdown is a **lightweight markup language** used to format plain text.  
In Jupyter Notebooks, Markdown cells let you add **headings, lists, links, images, equations, and more** to explain your code and results.

### 🔑 Common Elements
- **Headings**  
  `# Heading 1`  
  `## Heading 2`  
  `### Heading 3`  

- **Text formatting**  
  *Italic* → `*Italic*`  
  **Bold** → `**Bold**`  
  `Code` → `` `Code` ``  

- **Lists**  
  - Bulleted list → `- item` or `* item`  
  1. Numbered list → `1. item`  

- **Links & Images**  
  [Example link](https://example.com) → `[Example link](https://example.com)`  
  ![Alt text](https://example.com/image.png) → `![Alt text](url)`  

There are also more advanced features like:
- **Tables**
- **Math (LaTeX syntax)**  
- **Footnotes**
- **Blockquotes**  
- **Other formatting options**


### 📚 Learn More
👉 [Markdown Guide on GitHub](https://guides.github.com/features/mastering-markdown/)



---

📦 **Mini Task: Markdown Practice**

Create a **bulleted list** of 3 of your favorite websites.  
Each list item should be a **clickable link**.

👉 Example format (replace with your own sites):

- [Riga Technical University](https://www.rtu.lv)  
- [Python Official](https://www.python.org)  
- [Pandas Documentation](https://pandas.pydata.org/docs/)

---


## 💻 Code Cells in Jupyter

Code cells let you **write and run code directly** inside the notebook.  
In this course we’ll use **Python**, but Jupyter can support many languages.

### 🔑 Key Points
- Code cells can contain Python code (or another supported language).  
- Press **Shift + Enter** (or the ▶️ button) to run a cell.  
- The output appears **directly below the cell**.  
- Variables and functions you define stay in memory until the kernel is restarted.  

### 🐍 Example
```python
# A simple Python example
name = "RTU"
year = 2025
message = f"Hello {name}! Welcome to {year}."
print(message)
```

Note above code will not run because it is in a markdown cell!!
To run code we need to create a code cell! See below:

In Google Colab, you can create a new code cell by clicking the `+ Code` button or using the keyboard shortcut `Ctrl + M B` (or `Cmd + M B` on Mac).  
Then you can paste the code above into the new cell and run it.

In [1]:
# A simple Python example 
# hash means this is a comment - computer ignores it
# use comments to explain your code - WHY you did something, not just WHAT you did
name = "Valdis" # we supply data to the variable name
# Python is case-sensitive, so 'name' and 'Name' are different variables
# in Python, we use '=' to assign values to variables
# in Python we do not need to declare variable types like in some other languages (such as Javascript and many others)

year = 2025 # we supply data to the variable year
# variables are like boxes that hold data
message = f"Hello {name}! Welcome to {year}." # f-string allows us to insert variables into strings
# f-string starts with 'f' and allows us to use curly braces {} to insert variables
print(message) # print() outputs the message to the console

Hello Valdis! Welcome to 2025.


---

📦 **Mini Task: Your First Variable**

1. Create a variable with your **first name** as a string.  
2. Print a greeting that includes your name using an **f-string**.  

👉 Example (replace with your own name):

```python
my_name = "Anna"
print(f"Hello, my name is {my_name}")
