### Portfolio Notebook
Cheng Zhong <br>
cheng.zhong@columbia.edu

This notebook summarizes three machine learning projects I conducted in Spring 2021 for the advanced machine learning class at Columbia University. These projects analyzes real-world probelms with deep learning and neural networks algorithms. Different types of data were adopted, including tabular, image, and text data. Supervised machine learning models were applied to select key features, identify patterns, and generate evaluation metrics.

### Project 1: Predict World Happiness Rankings

**Link to project:** https://github.com/chengzhong666/Advanced-Machine-Learning-Portfolio/blob/main/Project%201%20-%20Predict%20World%20Happiness%20Rankings/Predicting%20Happiness-Cheng%20Zhong.ipynb

**Research questions:**
- What makes the citizens of one country more happy than the citizens of other countries?
- Do variables measuring perceptions of corruption, GDP, maintaining a healthy lifestyle, or social support associate with a country's happiness ranking?

**Data source:** 2019 World Happiness Survey Rankings (https://worldhappiness.report/)

**Data type:** Tabular data

**Features:**
*   Country or region
*   GDP per capita
*   Social support
*   Healthy life expectancy
*   Freedom to make life choices
*   Generosity
*   Perceptions of corruption

**Target:**
*   Happiness_level (Very High = Top 20% and Very Low = Bottom 20%)

**Summary:**

To understand how the features listed above influence on the happiness level, I've experimented with logistic regression, penalized logistic regression, k-nearest neighbors, random forest, and neural networks with keras models to select some of the most impactful features. Since the dataset is listed by country, the total length of data is 156. The size is even smaller after splitting the train and test sets, which may have a negative influence on training the models. After scaling the data, I adopted GridSearch cross validation to tune hyperparameters for each models. For instance, C for logistic regressions, k neighbors for KNN model, and nodes and learning rate for keras models.

My best performing model is a keras model with an accuracy of 0.4615 and loss of 1.2060. The structure of the model is listed below:

In [2]:
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, BatchNormalization

In [4]:
model = Sequential()
model.add(Dense(64, input_dim=11, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(5, activation='softmax')) 
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_4 (Dense)              (None, 64)                768       
_________________________________________________________________
dense_5 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_6 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_7 (Dense)              (None, 5)                 325       
Total params: 9,413
Trainable params: 9,413
Non-trainable params: 0
_________________________________________________________________


### Project 2: Predict Covid Positivity from X-Ray Image

**Link to project:** https://github.com/chengzhong666/Advanced-Machine-Learning-Portfolio/blob/main/Project%202%20-%20Predict%20Covid%20Positivity%20from%20X-Ray%20Image/Covid.ipynb

**Research goal:** Classify X-ray images into three categories: 1) Covid pneumonia, 2) non-covid pneumonia, and 3) normal.


**Data source:** M.E.H. Chowdhury, T. Rahman, A. Khandakar, R. Mazhar, M.A. Kadir, Z.B. Mahbub, K.R. Islam, M.S. Khan, A. Iqbal, N. Al-Emadi, M.B.I. Reaz, “Can AI help in screening Viral and COVID-19 pneumonia?” arXiv preprint, 29 March 2020, https://arxiv.org/abs/2003.13145

**Data type:** Image data

**Summary:**

From the beginning of 2020, coronavirus (COVID-19) has been seriously endangering people's health and lives. It is significant to correctly identify patients who are infected. Chest X-ray images have been an important evidence to diagnose positive cases. By applying convolutional neural network and transfer learning methods, the diagnostic accuracy could be improved. Medical workers, scientists, and policy-makers could advance their analytical judgements and optimize their decisions.

After preprocessing the image data, I experimented with three models:
- Model 1: Keras convolutional neural network (accuracy: 0.9460)
- Model 2: Keras convolutional neural network (accuracy: 0.9447)
           second convoutional layer w/ kernal size = 3
           initiate reduceLROnPlateau to avoid overfitting
- Model 3: Transfer learning VGG16 model (accuracy: 0.9523)

In [7]:
from tensorflow.keras.applications import VGG16 
from tensorflow.keras.layers import Flatten
from tensorflow.keras.models import Model

base_model = VGG16(input_shape=(192,192,3), include_top=False, weights='imagenet')
base_model.trainable = False 
flat = Flatten()(base_model.layers[-1].output)
class_ = Dense(1024, activation='relu')(flat)
output = Dense(3, activation='softmax')(class_)
model = Model(inputs=base_model.inputs, outputs=output)
model.summary()

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, 192, 192, 3)]     0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 192, 192, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 192, 192, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 96, 96, 64)        0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 96, 96, 128)       73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 96, 96, 128)       147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 48, 48, 128)      

### Project 3: Predict Covid Tweets Misinformation

**Link to project:** https://github.com/chengzhong666/Advanced-Machine-Learning-Portfolio/blob/main/Project%203%20-%20Predict%20Covid%20Tweets%20Misinformation/covid%20misinformation.ipynb

**Research goal:** Classify X-ray images into three categories: 1) Covid pneumonia, 2) non-covid pneumonia, and 3) normal.

**Data source:** Shahi, Gautam Kishore, Anne Dirkson, and Tim A. Majchrzak. "An exploratory study of covid-19 misinformation on twitter." Online Social Networks and Media 22 (2021): 100104 ("https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv")

**Data type:** Text data

**Summary:**

This dataset contains text data on covid-19 information tweets. The labels for the tweets are two categories, real and false. Building a predictive model using that is practically useful for identifying the truthfulness of information. It could improve the efficiency for the public to adopt correct knowledge for covid and prevent the spread of rumors.

The text data reflect different patterns for true and false tweets. For instance, veracious tweets generally show a neutral tone, use informative language, and avoid hateful speech. On the other hand, false news tweets show their inflammatory nature, deny scientific approaches to fight over the pandemic, and incite ignorance and hatred.

By applying deep learning algorithms to this dataset, these patterns of real and false tweets could be analyzed and identified in a relatively automated way. The models generated could be used for future inputs, and the decision makers could predict future trends and regulations and optimize resources.

After preprocessing the text data, I experimented with four models:
- Model 1: 1 embedding layer + 2 LSTM layers (accuracy: 0.9411)
- Model 2: 1 embedding layer + 2 LSTM layers w/ dropout regularization on the second layer (accuracy: 0.9393)
- Model 3: 1 embedding layer + 1 conv 1D layer + 2 LSTM layers w/ dropout regularization on the second LSTM layer (accuracy: 0.9369)
- Model 4: 1 embedding layer + 1 bidirectional LSTM layer + 1 LSTM layer with dropout regularization (accuracy: 0.9472)

In [None]:
from tensorflow.keras.layers import Embedding, LSTM, Bidirectional

model3 = Sequential()
model3.add(Embedding(10000, 100, input_length=40))
model3.add(Bidirectional(LSTM(40, activation='tanh', return_sequences=True)))
model3.add(LSTM(60, dropout=0.2, recurrent_dropout=0.2, activation='tanh'))
model3.add(Dense(40, activation='relu'))
model3.add(Dense(2, activation='softmax'))
model3.summary()

Please feel free to reach out through my email if you want to talk about these projects. Thank you! 