<a href="https://colab.research.google.com/github/XTMay/python-data-science-course/blob/main/notebooks/Lec_1_Python_Basics_for_Data_Science.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 01: Python Basics for Data Science

**目标**：掌握 Python 的基本语法、变量、数据结构和控制流程，为后续的数据处理和分析打下基础。




## 📚 课程提纲


### 1️⃣ Python 简介
#### 💻 什么是 Python？

[Welcome to Python](https://www.python.org/)

Python 是一种高级、通用、解释型的编程语言，由 Guido van Rossum 于 1991 年发布。它以语法简洁、易读性强、开发效率高著称，非常适合初学者入门，也足以支撑大型系统开发。

`✅ 一句话理解：Python 就像英文一样自然，它让写代码更接近人类思维。`

[tutorial](https://docs.python.org/3.15/tutorial/index.html)

#### 📊 为什么 Python 广泛应用于数据科学？

#### Python has libraries with large collections of mathematical functions and analytical tools.

*   Pandas - This library is used for structured data operations,
like import CSV files, create dataframes, and data preparation
*   Numpy - This is a mathematical library. Has a powerful N-dimensional array object, linear algebra, Fourier transform, etc.
*   Matplotlib - This library is used for visualization of data.
*   SciPy - This library has linear algebra modules
We will use these libraries throughout the tutorial to create examples.

| 优势 | 说明 |
|------|------|
| 📚 丰富的科学计算库 | 如 NumPy（数值计算）、pandas（数据处理）、Matplotlib/Seaborn（可视化）、scikit-learn（机器学习）等，使得数据处理变得高效而轻松。 |
| 🧠 强大的 AI / ML 支持 | 包括 TensorFlow、PyTorch、XGBoost、LightGBM 等深度学习和建模框架，几乎所有主流 AI 工具都有 Python 接口。 |
| 🛠️ 简洁易学，代码可读性强 | 相比 R 或 Java，Python 更容易学习和维护，非常适合非程序员背景的数据分析师、科研人员。 |
| 🌍 强大的社区和生态系统 | 无论是初学者教程、开源项目还是技术问答（如 StackOverflow），Python 的数据科学社区非常活跃，遇到问题容易找到解决方案。 |
| 📈 跨平台和可扩展性 | Python 可以在 Windows / Mac / Linux 上运行，也能与 C/C++/Java 等语言集成，非常灵活。 |

In [None]:
print("Hello, Data Science! 👋")

Hello, Data Science! 👋


In [None]:
print("Hello World!") # Function: print, parameter: "Hello World!"

Hello World!


### 2️⃣ 变量与基本数据类型

	•	常见类型：int, float, str, bool
	•	类型检查与转换：type(), int(), float(), str()

In [None]:
name = "Alice"
age = 28
height = 1.65
is_data_scientist = False

In [None]:
print(name, age, height, is_data_scientist)

Alice 28 1.65 True


In [None]:
print(type(name))

<class 'str'>


In [None]:
print(type(3.14))

<class 'float'>


- True: 1, False: 0

In [None]:
is_data_scientist = False

In [None]:
float(is_data_scientist)

0.0

### 	用户输入函数（Data Collection）

In [None]:
user_name = input("请输入你的名字：")
print("欢迎加入数据科学世界，", user_name)

请输入你的名字：ma
欢迎加入数据科学世界， ma


### 3️⃣ 常见数据结构

#### 🔹 List（列表）— 数据集基本容器

In [None]:
cities = ["Tokyo", "New York", "London"]

In [None]:
cities.append("Shanghai") # List use append

In [None]:
print(cities)

['Tokyo', 'New York', 'London', 'Shanghai']


In [None]:
cities[2]

'London'

In [None]:
cities

['Tokyo', 'New York', 'London', 'Shanghai']

In [None]:
len(cities)

4

In [None]:
len('Tokyo')

5

In [None]:
lst_test = [123, "123", 123.45, True]

In [None]:
lst_test[-1]

True

In [None]:
lst_test[-2]

123.45

#### 🔹 Dict（字典）— 类似数据库中键值结构 (key-value)

In [None]:
person = {"name": "Alice", "age": 28} # key: value(any type)

In [None]:
person

{'name': 'Alice', 'age': 28}

In [None]:
print(person["name"])

May


In [None]:
person["name"] = 'May'

In [None]:
person

{'name': 'May', 'age': 28}

In [None]:
dic_test = {1: '1', '2': 2}

In [None]:
dic_test[1]

'1'

In [None]:
dic_test = {3.14: '1', 4.23: 2}

In [None]:
dic_test[3.14]

'1'

In [None]:
dic_test = {True: 1, False: 0}

In [None]:
dic_test[True]

1

In [None]:
dic_test = {"list": {"1": 1, "5": 5}, "dict": {"key": "value", "key2": "value2"}}

In [None]:
dic_test_2 = {[1, 2, 3]: "test"} # key: int, float, str, bool, value:  any

TypeError: unhashable type: 'list'

- Key: Unique

In [None]:
person['height'] = 164 # add an item/update

In [None]:
person

{'name': 'May', 'age': 18, 'height': 164}

In [None]:
person['age'] = 18

#### 🔹 Tuple（元组）& Set（集合）

In [None]:
coordinates = (35.6, 139.6) # Tuple 不可修改

In [None]:
coordinates

(35.6, 139.6)

In [None]:
coordinates[1]

139.6

In [None]:
coordinates[1] = 123 # 'tuple' object does not support item assignment

TypeError: 'tuple' object does not support item assignment

In [None]:
unique_items = set([1, 2, 2, 3])  # Set 自动去重

In [None]:
unique_items

{1, 2, 3}

### 4️⃣ 控制流程

#### 🔸 条件判断：if / elif / else

In [None]:
score = 85

In [None]:
if score >= 90:
    print("优秀")
elif 60 <= score < 80:
    print("及格")
else:
    print("不及格")

不及格


- condition 1 **AND** condition 2
- condition 1 **OR** condition 2

In [None]:
True or False

True

In [None]:
height = 170

In [None]:
if height > 170 and height < 2:
  print("high")
elif height <= 170 or height > 164: # height < 170 AND height >= 160
  print("medium")
else:
  print("low")

medium


#### 🔸 循环结构：for / while

In [None]:
lst = [1, 2, 3]

In [None]:
for item in lst:
  print(item)

1
2
3


In [None]:
for i in range(1, 15, 5):
    print("Data point", i)

Data point 1
Data point 6
Data point 11


In [None]:
n = 0

In [None]:
while n < 6:
    print("循环次数:", n)
    if n == 4:
      n += 1
      continue

    n += 1
    print('n+1')


循环次数: 0
n+1
循环次数: 1
n+1
循环次数: 2
n+1
循环次数: 3
n+1
循环次数: 4
循环次数: 5
n+1


In [None]:
# continue/break : continue: skip loop, continue next loop, break: jump

In [None]:
# 在1 到 20 之间 「1， 20」， 如果是偶数（2， 4， 6， 8， 10），(2 * 2)计算这个数字的平方，并打印输出

In [None]:
for i in range(1, 21):
  if i % 2 == 0:
    print(i, i ** 2)

2 4
4 16
6 36
8 64
10 100
12 144
14 196
16 256
18 324
20 400


In [None]:
# 偶数 n % 2 == 0
i = 2

In [None]:
for i in range(1, 21): # 死循环 1， 20
  # 偶数
  if i % 2 == 0:
    print(i * i)
  # 奇数
  elif i % 2 == 1:
    print(i * i * i)

### 5️⃣ 课堂练习

#### ✏️ 练习 1：变量练习

In [None]:
name = input("你的名字：")
city = input("你最想去的数据城市：")
print(f"{name} 想成为一位在 {city} 工作的数据科学家！")

#### ✏️ 练习 2：BMI 计算器

In [None]:
def calculate_bmi(weight, height): # function
    return weight / (height ** 2)

bmi = calculate_bmi(60, 1.65)
print("你的 BMI 为：", bmi)

In [None]:
import pandas as pd

df_movie = pd.read_csv('/content/drive/MyDrive/AI_Lecture/AI_Data_Scientist/dataset/Netflix_Dataset_Movie.csv')
print(df_movie.head())

In [None]:
# convert df_movie['Year'] to list
lst_year = df_movie['Year'].tolist()
lst_year

In [None]:
lst_movie = list(df_movie['Year'])
lst_movie

In [None]:
type(lst_movie)

In [None]:
# convert df_movie['Name'] df_movie['Year'] as dictionary
movie_dict = df_movie.set_index('Name')['Year'].to_dict()
movie_dict

In [None]:
df_rating = pd.read_csv('/content/drive/MyDrive/AI_Lecture/AI_Data_Scientist/dataset/Netflix_Dataset_Rating.csv')
print(df_rating.head())

### ✅ 本课小结

	•	✅ Python 是数据科学的核心语言
	•	✅ 掌握变量、数据类型、输入输出
	•	✅ 熟悉常用数据结构（list, dict 等）
	•	✅ 能写简单判断和循环


In [None]:
# 电影数据
movies = [
    {"1": "Dinosaur Planet"},
    {"2":"Isle of Man TT 2004 Review"}
]

# 评分数据
ratings = [
    {"User_ID": 712664, "Rating": 5, "Movie_ID": 3},
    {"User_ID": 1331154, "Rating": 4, "Movie_ID": 3},
    {"User_ID": 2632461, "Rating": 3, "Movie_ID": 3},
    {"User_ID": 44937, "Rating": 5, "Movie_ID": 3},
    {"User_ID": 656399, "Rating": 4, "Movie_ID": 3},
]

### 📌 课后练习题

#### ✅ Q1：输出所有电影名称

用 for 循环遍历 movies 列表，并打印所有电影的名称（Name 字段）。


In [None]:
movies[:10]

[{'1': 'Dinosaur Planet'}, {'2': 'Isle of Man TT 2004 Review'}]

In [None]:
type(movies[0])

dict

In [None]:
for movie in movies:
  for key, val in movie.items():
    print(val)

Dinosaur Planet
Isle of Man TT 2004 Review


#### ✅ Q2：找出年份1990 < Year <= 2000 年的电影

写一个程序，找出并输出 1990 < Year <= 2000 的所有电影名称。

In [None]:
movie_dict = df_movie.set_index('Name')['Year'].to_dict()
for name, year in movie_dict.items():
  if 1990 < year <= 2000:
    print(name)


#### ✅ Q3：计算电影 ID 为 3 的平均评分

从 ratings 中提取所有 Movie_ID == 3 的评分，并计算平均分（四舍五入保留 1 位小数）。

提示：

In [None]:
df_rating.head(10)

In [None]:
ratings_id_3 = df_rating[df_rating['Movie_ID'] == 3]['Rating'].tolist()
ratings_id_3[:10]

In [None]:
total = 0
count = 0
for score in ratings_id_3:
    total += score
    count += 1
average = round(total / count, 1)
print("Movie_ID == 3 的平分均", average)

#### ✅ Q4：把评分记录中的电影名也加上（字典合并练习）

用 for 循环，将每条评分(rating)记录补充上该电影的名称（来自 movies），并组成一个新列表 ratings_with_name：

输出示例：

In [None]:
{
  'User_ID': 712664,
  'Rating': 5,
  'Movie_ID': 3,
  'Name': 'Character'
}

In [None]:
movie_dict = df_movie.set_index('Movie_ID')['Name'].to_dict()
movie_dict.get(3)

In [None]:
lst_all = []
for idx in range(len(df_rating)):
  dic_user_rating_movie = {}
  # print(df_rating.iloc[idx])
  dic_user_rating_movie['User_ID'] = int(df_rating.iloc[idx]['User_ID'])
  dic_user_rating_movie['Rating'] = int(df_rating.iloc[idx]['Rating'])

  movie_id = df_rating.iloc[idx]['Movie_ID']
  dic_user_rating_movie['Movie_ID'] = int(movie_id)
  dic_user_rating_movie['Name'] = movie_dict[int(movie_id)]
  lst_all.append(dic_user_rating_movie)

  if idx > 5000:
    break

In [None]:
lst_all[:10]

In [None]:
int('123')

#### ✅ Q5（挑战题）：统计每部电影的评分人数

输出格式：



```
Character: 4人
Dinosaur Planet: 0人
...
```


提示：
	•	用字典统计：{"Movie_Name": count}

In [None]:
import collections

In [None]:
dic_cnt = collections.defaultdict(int) # initialize a dictionary
dic_cnt

In [None]:
{key:value} # value --> int (0)

In [None]:
for i in range(len(lst_all)):
  movie_name = lst_all[i]['Name']
  dic_cnt[movie_name] += 1 # dic_cnt[movie_name] = 0
dic_cnt