# Build a Web App to use a ML Model

1. 学习链接：https://github.com/microsoft/ML-For-Beginners/blob/main/3-Web-App/1-Web-App/README.md
2. 数据链接：https://www.kaggle.com/datasets/NUFORC/ufo-sightings/data
3. 学习目标：模型保存，并在 Web 应用中用于预测
4. 关于数据集
- 该数据集包含了过去一个世纪中 8 万多条 UFO 目击报告。由于这些报告可以追溯到 20 世纪，部分较早的数据可能不够清晰或不完整。数据字段包括：城市、州、时间、描述以及每次目击的持续时长。

- 两个版本：清洗版（scrubbed） 和 完整版（complete）
    - 完整数据中包含一些问题记录，例如：目击地点未找到或为空（占 0.8146%），时间信息错误或缺失（占 8.0237%）

- 研究启发（Inspiration）
    - 哪些地区最有可能出现 UFO 目击事件？
    - UFO 目击是否随时间呈现某种趋势？是否具有聚集性或季节性？
    - UFO 目击事件的聚集是否与某些地标有关，例如机场或政府研究中心？
    - 最常见的 UFO 描述是什么？
5. 依赖环境

        pip3 install pandas numpy scikit-learn tensorflow flask joblib

In [34]:

import pandas as pd

df = pd.read_csv("/Users/yfan/Downloads/archive/scrubbed.csv", low_memory=False)

print("Columns:", df.columns.tolist())
df = df.dropna(subset=['duration (seconds)', 'latitude', 'longitude ', 'shape'])
df.head(5)


Columns: ['datetime', 'city', 'state', 'country', 'shape', 'duration (seconds)', 'duration (hours/min)', 'comments', 'date posted', 'latitude', 'longitude ']


Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611


In [31]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
import pickle

def train_model():
    print("正在加载并清洗数据...")
    # 1. 加载数据
    df = pd.read_csv('/Users/yfan/Downloads/archive/scrubbed.csv', low_memory=False)

    # 2. 强力清洗数值列 (解决 ValueError: could not convert string to float)
    def clean_col(col_name):
        # 移除除数字、小数点、负号以外的任何字符
        df[col_name] = df[col_name].astype(str).str.replace(r'[^0-9.-]', '', regex=True)
        return pd.to_numeric(df[col_name], errors='coerce')

    df['duration (seconds)'] = clean_col('duration (seconds)')
    df['latitude'] = clean_col('latitude')
    # 注意原始列名中 'longitude' 后面可能带有一个空格
    df['longitude '] = clean_col('longitude')

    # 3. 处理缺失值和标签
    df = df.dropna(subset=['duration (seconds)', 'latitude', 'longitude', 'shape'])

    # 过滤掉出现次数极少的形状，提高模型稳定性
    shape_counts = df['shape'].value_counts()
    df = df[df['shape'].isin(shape_counts[shape_counts > 100].index)]

    X = df[['duration (seconds)', 'latitude', 'longitude']].values
    y = df['shape'].values

    # 4. 编码与标准化
    le = LabelEncoder()
    y_encoded = le.fit_transform(y)

    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_encoded, test_size=0.2, random_state=42)

    # 5. 构建深度学习模型
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.3),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(len(le.classes_), activation='softmax')
    ])

    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

    # 6. 训练模型
    print("开始训练...")
    model.fit(X_train, y_train, epochs=15, batch_size=64, validation_split=0.1, verbose=1)

    # 7. 保存所有组件
    model.save('ufo_model.h5')
    with open('scaler.pkl', 'wb') as f:
        pickle.dump(scaler, f)
    with open('label_encoder.pkl', 'wb') as f:
        pickle.dump(le, f)

    print("\n训练完成！已生成: ufo_model.h5, scaler.pkl, label_encoder.pkl")

if __name__ == "__main__":
    train_model()

正在加载并清洗数据...
开始训练...
Epoch 1/15


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m882/882[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 447us/step - accuracy: 0.1976 - loss: 2.7037 - val_accuracy: 0.2126 - val_loss: 2.6418
Epoch 2/15
[1m882/882[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 396us/step - accuracy: 0.2119 - loss: 2.6388 - val_accuracy: 0.2148 - val_loss: 2.6439
Epoch 3/15
[1m882/882[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 379us/step - accuracy: 0.2130 - loss: 2.6281 - val_accuracy: 0.2150 - val_loss: 2.6403
Epoch 4/15
[1m882/882[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 396us/step - accuracy: 0.2133 - loss: 2.6239 - val_accuracy: 0.2151 - val_loss: 2.6293
Epoch 5/15
[1m882/882[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 386us/step - accuracy: 0.2134 - loss: 2.6213 - val_accuracy: 0.2153 - val_loss: 2.6299
Epoch 6/15
[1m882/882[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 389us/step - accuracy: 0.2134 - loss: 2.6204 - val_accuracy: 0.2153 - val_loss: 2.6292
Epoch 7/15
[1m882/882[0m 




训练完成！已生成: ufo_model.h5, scaler.pkl, label_encoder.pkl
