# 🧠 Lesson 01: Python Basics for Data Science

🎯 **目标**：掌握 Python 的基本语法、变量、数据结构和控制流程，为后续的数据处理和分析打下基础。




## 📚 课程提纲
Colab

VS code



### 1️⃣ Python 简介
#### 💻 什么是 Python？

[Welcome to Python](https://www.python.org/)

Python 是一种高级、通用、解释型的编程语言，由 Guido van Rossum 于 1991 年发布。它以语法简洁、易读性强、开发效率高著称，非常适合初学者入门，也足以支撑大型系统开发。

`✅ 一句话理解：Python 就像英文一样自然，它让写代码更接近人类思维。`

[tutorial](https://docs.python.org/3.15/tutorial/index.html)

R

#### 📊 为什么 Python 广泛应用于数据科学？

#### Python has libraries with large collections of mathematical functions and analytical tools.

*   Pandas - This library is used for structured data operations,
like import CSV files, create dataframes, and data preparation
*   Numpy - This is a mathematical library. Has a powerful N-dimensional array object, linear algebra, Fourier transform, etc.
*   Matplotlib - This library is used for visualization of data.
*   SciPy - This library has linear algebra modules
We will use these libraries throughout the tutorial to create examples.

| 优势 | 说明 |
|------|------|
| 📚 丰富的科学计算库 | 如 NumPy（数值计算）、pandas（数据处理）、Matplotlib/Seaborn（可视化）、scikit-learn（机器学习）等，使得数据处理变得高效而轻松。 |
| 🧠 强大的 AI / ML 支持 | 包括 TensorFlow、PyTorch、XGBoost、LightGBM 等深度学习和建模框架，几乎所有主流 AI 工具都有 Python 接口。 |
| 🛠️ 简洁易学，代码可读性强 | 相比 R 或 Java，Python 更容易学习和维护，非常适合非程序员背景的数据分析师、科研人员。 |
| 🌍 强大的社区和生态系统 | 无论是初学者教程、开源项目还是技术问答（如 StackOverflow），Python 的数据科学社区非常活跃，遇到问题容易找到解决方案。 |
| 📈 跨平台和可扩展性 | Python 可以在 Windows / Mac / Linux 上运行，也能与 C/C++/Java 等语言集成，非常灵活。 |

In [2]:
print("Hello, Data Science! 👋")

Hello, Data Science! 👋


In [1]:
print("Hello World!") # Function: print, parameter: "Hello World!"

Hello World!


### 2️⃣ 变量与基本数据类型

	•	常见类型：int, float, str, bool
	•	类型检查与转换：type(), int(), float(), str()

In [4]:
name = "Alice"
age = 28
height = 1.65
is_data_scientist = True

In [5]:
print(name, age, height, is_data_scientist)

Alice 28 1.65 True


In [6]:
print(type(height))

<class 'float'>


In [7]:
print(type(1))

<class 'int'>


### 	用户输入函数（Data Collection）

In [8]:
user_name = input("请输入你的名字：")
print("欢迎加入数据科学世界，", user_name)

请输入你的名字：May
欢迎加入数据科学世界， May


### 3️⃣ 常见数据结构

#### 🔹 List（列表）— 数据集基本容器

In [9]:
cities = ["Tokyo", "New York", "London"]

In [10]:
cities.append("Shanghai") # List use append

In [11]:
print(cities)

['Tokyo', 'New York', 'London', 'Shanghai']


In [12]:
cities[2]

'London'

In [15]:
len('Tokyo')

5

In [16]:
lst_test = [123, "123", 123.45, True]

In [17]:
lst_test[-1]

True

In [18]:
lst_test[-2]

123.45

#### 🔹 Dict（字典）— 类似数据库中键值结构 (key-value)

In [19]:
person = {"name": "Alice", "age": 28} # key: value(any type)

In [22]:
print(person["name"])

May


In [21]:
person["name"] = 'May'

In [32]:
person

{'name': 'May', 'age': 28}

In [24]:
dic_test = {1: '1', '2': 2}

In [25]:
dic_test[1]

'1'

In [26]:
dic_test = {3.14: '1', 4.23: 2}

In [27]:
dic_test[3.14]

'1'

In [28]:
dic_test = {True: 1, False: 0}

In [29]:
dic_test[True]

1

In [30]:
dic_test = {"list": [1, 2, 3, 4, 5], "dict": {"key": "value", "key2": "value2"}}

In [None]:
dic_test_2 = {[1, 2, 3]: "test"} # key: int, float, str, bool, value:  any

In [35]:
person['height'] = 164 # add an item/update

In [38]:
person

{'name': 'May', 'age': 18, 'height': 164}

In [37]:
person['age'] = 18

#### 🔹 Tuple（元组）& Set（集合）

In [39]:
coordinates = (35.6, 139.6) # Tuple 不可修改

In [42]:
coordinates[1] = 123 # 'tuple' object does not support item assignment

TypeError: 'tuple' object does not support item assignment

In [43]:
unique_items = set([1, 2, 2, 3])  # Set 自动去重

In [44]:
unique_items

{1, 2, 3}

### 4️⃣ 控制流程

#### 🔸 条件判断：if / elif / else

In [45]:
score = 85

In [47]:
if score >= 90:
    print("优秀")
elif 60 <= score < 80:
    print("及格")
else:
    print("不及格")

不及格


condition 1 AND condition 2/ condition 1 OR condition 2

In [49]:
height = 170
# height > 170: high
# 160<= height < 170: medium
# others: low

if height > 170:
  print("high")
elif 170 > height >= 160: # height < 170 AND height >= 160
  print("medium")
else:
  print("low")

low


#### 🔸 循环结构：for / while

In [50]:
for i in range(5): # index 0 <= index < 5
    print("Data point", i)

Data point 0
Data point 1
Data point 2
Data point 3
Data point 4


In [51]:
n = 0
while n < 3: # n >= 3 跳出循环
    print("循环次数:", n)
    n += 1 # n = n + 1

循环次数: 0
循环次数: 1
循环次数: 2


In [None]:
# 在1 到 20 之间 「1， 20」， 如果是偶数（2， 4， 6， 8， 10），(2 * 2)计算这个数字的平方，并打印输出
# 如果是奇数，计算这个数字的立方 （3 -》 3 * 3 * 3），并打印输出

In [63]:
1 % 2

1

In [59]:
# 偶数 n % 2 == 0


i = 2

1

In [64]:
for i in range(1, 21): # 死循环 1， 20
  # 偶数
  if i % 2 == 0:
    print(i * i)
  # 奇数
  elif i % 2 == 1:
    print(i * i * i)

1
4
27
16
125
36
343
64
729
100
1331
144
2197
196
3375
256
4913
324
6859
400


### 5️⃣ 课堂练习

#### ✏️ 练习 1：变量练习

In [65]:
name = input("你的名字：")
city = input("你最想去的数据城市：")
print(f"{name} 想成为一位在 {city} 工作的数据科学家！")

你的名字：May
你最想去的数据城市：Hawaii
May 想成为一位在 Hawaii 工作的数据科学家！


#### ✏️ 练习 2：BMI 计算器

In [66]:
def calculate_bmi(weight, height): # function
    return weight / (height ** 2)

bmi = calculate_bmi(60, 1.65)
print("你的 BMI 为：", bmi)

你的 BMI 为： 22.03856749311295


In [67]:
import pandas as pd

df_movie = pd.read_csv('/content/drive/MyDrive/AI_Lecture/AI_Data_Scientist/dataset/Netflix_Dataset_Movie.csv')
print(df_movie.head())

   Movie_ID  Year                          Name
0         1  2003               Dinosaur Planet
1         2  2004    Isle of Man TT 2004 Review
2         3  1997                     Character
3         4  1994  Paula Abdul's Get Up & Dance
4         5  2004      The Rise and Fall of ECW


In [68]:
# convert df_movie['Year'] to list
lst_year = df_movie['Year'].tolist()
lst_year

[2003,
 2004,
 1997,
 1994,
 2004,
 1997,
 1992,
 2004,
 1991,
 2001,
 1999,
 1947,
 2003,
 1982,
 1988,
 1996,
 2005,
 1994,
 2000,
 1972,
 2002,
 2000,
 2001,
 1981,
 1997,
 2004,
 1962,
 2002,
 2001,
 2003,
 1999,
 2004,
 2000,
 2003,
 2000,
 1992,
 1973,
 2003,
 2000,
 2004,
 2000,
 2002,
 2000,
 1996,
 1999,
 1964,
 1952,
 2001,
 2003,
 1941,
 2002,
 2002,
 2003,
 1952,
 1995,
 2004,
 1995,
 1996,
 2003,
 1969,
 1999,
 1991,
 1943,
 2001,
 2000,
 1989,
 1997,
 2004,
 2003,
 1999,
 1995,
 1974,
 1954,
 1999,
 1997,
 1952,
 1995,
 1996,
 1956,
 1979,
 1991,
 1951,
 1983,
 2002,
 2005,
 1996,
 2002,
 1998,
 2000,
 1951,
 2005,
 2002,
 2004,
 2000,
 1985,
 2000,
 2002,
 1965,
 1989,
 1993,
 1997,
 2004,
 1976,
 1965,
 2002,
 2004,
 2000,
 2004,
 1996,
 1989,
 2003,
 1993,
 2000,
 1989,
 1973,
 2004,
 1957,
 1985,
 1999,
 2004,
 2003,
 2002,
 2000,
 2000,
 1981,
 2003,
 1987,
 1985,
 2003,
 1999,
 2002,
 1981,
 2003,
 1996,
 1998,
 1927,
 1998,
 1995,
 2001,
 1993,
 2004,
 2000,
 1997,

In [69]:
lst_movie = list(df_movie['Year'])
lst_movie

[2003,
 2004,
 1997,
 1994,
 2004,
 1997,
 1992,
 2004,
 1991,
 2001,
 1999,
 1947,
 2003,
 1982,
 1988,
 1996,
 2005,
 1994,
 2000,
 1972,
 2002,
 2000,
 2001,
 1981,
 1997,
 2004,
 1962,
 2002,
 2001,
 2003,
 1999,
 2004,
 2000,
 2003,
 2000,
 1992,
 1973,
 2003,
 2000,
 2004,
 2000,
 2002,
 2000,
 1996,
 1999,
 1964,
 1952,
 2001,
 2003,
 1941,
 2002,
 2002,
 2003,
 1952,
 1995,
 2004,
 1995,
 1996,
 2003,
 1969,
 1999,
 1991,
 1943,
 2001,
 2000,
 1989,
 1997,
 2004,
 2003,
 1999,
 1995,
 1974,
 1954,
 1999,
 1997,
 1952,
 1995,
 1996,
 1956,
 1979,
 1991,
 1951,
 1983,
 2002,
 2005,
 1996,
 2002,
 1998,
 2000,
 1951,
 2005,
 2002,
 2004,
 2000,
 1985,
 2000,
 2002,
 1965,
 1989,
 1993,
 1997,
 2004,
 1976,
 1965,
 2002,
 2004,
 2000,
 2004,
 1996,
 1989,
 2003,
 1993,
 2000,
 1989,
 1973,
 2004,
 1957,
 1985,
 1999,
 2004,
 2003,
 2002,
 2000,
 2000,
 1981,
 2003,
 1987,
 1985,
 2003,
 1999,
 2002,
 1981,
 2003,
 1996,
 1998,
 1927,
 1998,
 1995,
 2001,
 1993,
 2004,
 2000,
 1997,

In [70]:
type(lst_movie)

list

In [71]:
# convert df_movie['Name'] df_movie['Year'] as dictionary
movie_dict = df_movie.set_index('Name')['Year'].to_dict()
movie_dict

{'Dinosaur Planet': 2003,
 'Isle of Man TT 2004 Review': 2004,
 'Character': 1997,
 "Paula Abdul's Get Up & Dance": 1994,
 'The Rise and Fall of ECW': 2004,
 'Sick': 1997,
 '8 Man': 1992,
 'What the #$*! Do We Know!?': 2004,
 "Class of Nuke 'Em High 2": 1991,
 'Fighter': 2001,
 'Full Frame: Documentary Shorts': 1999,
 'My Favorite Brunette': 1947,
 'Lord of the Rings: The Return of the King: Extended Edition: Bonus Material': 2003,
 'Nature: Antarctica': 1982,
 'Neil Diamond: Greatest Hits Live': 1988,
 'Screamers': 1996,
 '7 Seconds': 2005,
 'Immortal Beloved': 1994,
 "By Dawn's Early Light": 1990,
 'Seeta Aur Geeta': 1972,
 'Strange Relations': 2002,
 'Chump Change': 2000,
 "Clifford: Clifford Saves the Day! / Clifford's Fluffiest Friend Cleo": 2001,
 'My Bloody Valentine': 1981,
 'Inspector Morse 31: Death Is Now My Neighbour': 1997,
 'Never Die Alone': 2004,
 "Sesame Street: Elmo's World: The Street We Live On": 1962,
 'Lilo and Stitch': 2002,
 'Boycott': 2001,
 "Something's Gotta 

In [72]:
df_rating = pd.read_csv('/content/drive/MyDrive/AI_Lecture/AI_Data_Scientist/dataset/Netflix_Dataset_Rating.csv')
print(df_rating.head())

   User_ID  Rating  Movie_ID
0   712664       5         3
1  1331154       4         3
2  2632461       3         3
3    44937       5         3
4   656399       4         3


### ✅ 本课小结

	•	✅ Python 是数据科学的核心语言
	•	✅ 掌握变量、数据类型、输入输出
	•	✅ 熟悉常用数据结构（list, dict 等）
	•	✅ 能写简单判断和循环


In [None]:
# 电影数据
movies = [
    {"Movie_ID": 1, "Year": 2003, "Name": "Dinosaur Planet"},
    {"Movie_ID": 2, "Year": 2004, "Name": "Isle of Man TT 2004 Review"},
    {"Movie_ID": 3, "Year": 1997, "Name": "Character"},
    {"Movie_ID": 4, "Year": 1994, "Name": "Paula Abdul's Get Up & Dance"},
    {"Movie_ID": 5, "Year": 2004, "Name": "The Rise and Fall of ECW"},
]

# 评分数据
ratings = [
    {"User_ID": 712664, "Rating": 5, "Movie_ID": 3},
    {"User_ID": 1331154, "Rating": 4, "Movie_ID": 3},
    {"User_ID": 2632461, "Rating": 3, "Movie_ID": 3},
    {"User_ID": 44937, "Rating": 5, "Movie_ID": 3},
    {"User_ID": 656399, "Rating": 4, "Movie_ID": 3},
]

### 📌 课后练习题

#### ✅ Q1：输出所有电影名称

用 for 循环遍历 movies 列表，并打印所有电影的名称（Name 字段）。


#### ✅ Q2：找出年份1990 < Year <= 2000 年的电影

写一个程序，找出并输出 1990 < Year <= 2000 的所有电影名称。


#### ✅ Q3：计算电影 ID 为 3 的平均评分

从 ratings 中提取所有 Movie_ID == 3 的评分，并计算平均分（四舍五入保留 1 位小数）。

提示：

In [None]:
total = 0
count = 0
for r in ratings:
    if r["Movie_ID"] == 3:
        total += r["Rating"]
        count += 1
average = round(total / count, 1)

#### ✅ Q4：把评分记录中的电影名也加上（字典合并练习）

用 for 循环，将每条评分(rating)记录补充上该电影的名称（来自 movies），并组成一个新列表 ratings_with_name：

输出示例：

In [None]:
{
  'User_ID': 712664,
  'Rating': 5,
  'Movie_ID': 3,
  'Name': 'Character'
}

#### ✅ Q5（挑战题）：统计每部电影的评分人数

输出格式：



```
Character: 4人
Dinosaur Planet: 0人
...
```


提示：
	•	用字典统计：{"Movie_Name": count}