<img src="logos.jpg" width="700" />

# Tensorflow example using Sina Weibo dataset

The purpose of this notebook is to ilustrate the use of __Tensorflow__ package in an example that includes the construction of a neural network and the use of a supervised learning approach based in the back-propagation theory. The proposed example is based on dataset obtained from different social media interactions through the Sina Weibo platform in 2014 and 2015.

Example main features:

1. All the code was implemented in Python 3.5 for Windows and Python 2.7 for Linux/Mac https://www.python.org/ 
2. The  Python packages required to run the programs are the following:
    * Jupyter notebook (Python interactive prompt) http://jupyter.org/index.html
    * Tensorflow (Representation-classification) https://www.tensorflow.org/
    * Numpy (classification) http://www.numpy.org/
    * Matplotlib (visualization) https://matplotlib.org/
    * Pandas (Data analysis) https://pandas.pydata.org/
    * Xlrd (Data analysis) https://pypi.org/project/xlrd/
3. The Tensorflow version used correspond to a standalone application which means that does not support the use of parallel 
   computations on GPUs.

    
## Step 1: Load the Sina Weibo dataset 


The dataset used in this notebook is an excerpt of the Sina Weibo platform (https://weibo.com/login.php) from September 2014 to May 2015 which contains 174,968 interactions associated to different users in the website each of them with distinct features related like year, city, gender, etc.

In [1]:
#Import Pandas package which help us to work with a structured dataset easily
import pandas as pd
#For displaying a Pandas table in the jupyter notebook
from IPython.display import display, Markdown
#Read Microsoft Excel file
excel = pd.ExcelFile("2015_weibodata.xlsx")
#Load the dataset from the file as a Pandas table (called also dataframe)
data= excel.parse(0) #Obtain first sheet from the Excel file
#Show the first five rows of the table
data.head(5)

Unnamed: 0,`year`,`month`,`day`,`hour`,`minute`,`lon`,`lat`,`user_id`,`gender`,`province`,...,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30
0,2014,9,1,15,52,121.57677,31.33588,2203784962,'f','31',...,,,,,,,,,,
1,2014,9,1,16,47,121.5495,31.3666,1006306595,'m','100',...,,,,,,,,,,
2,2014,9,1,17,7,121.569,31.3496,2027776417,'m','31',...,,,,,,,,,,
3,2014,9,1,19,52,121.56594,31.346882,3948063033,'m','100',...,,,,,,,,,,
4,2014,9,1,21,44,121.57879,31.34178,2275605075,'f','31',...,,,,,,,,,,


## Step 2: Preprocess the Sina Weibo dataset

In this step, the unnecessary columns are discarded. Additionally the columns that have a string data type but have an integer nature are cast to the appropriate type.


In [2]:
%%capture
#Jupyter notebook command that disables the Python output for this code box.
#Eliminate columns that don't have any meaningful information.
cleanData = data[data.columns[:-14]]
#Change the name of the columns
cleanData.columns = ["year","month","day","hour","minute","lon","lat","user_id","gender",
                     "province","city","statusesCount","followersCount","friendsCount", 
                     "repostsCount","commentsCount","text"]
#Transform province and city columns from string to integer.
cleanData["province"]=cleanData["province"].str[2:-1].apply(int)
cleanData["city"]=cleanData["city"].str[2:-1].apply(int)
#Transform lat and lon columns from string to float.
cleanData["lat"]=cleanData["lat"].apply(float)
cleanData["lon"]=cleanData["lat"].apply(float)


In [3]:
#Obtain the first five rows of the table
cleanData.head(5)

Unnamed: 0,year,month,day,hour,minute,lon,lat,user_id,gender,province,city,statusesCount,followersCount,friendsCount,repostsCount,commentsCount,text
0,2014,9,1,15,52,31.33588,31.33588,2203784962,'f',31,15,750,128,84,0,0,'#一周综艺看点#我分享了专题《一周综艺看点》，刘烨哭断片，谢霆锋容祖儿坎坷出道经历，快来...
1,2014,9,1,16,47,31.3666,31.3666,1006306595,'m',100,1000,1,0,0,0,0,'最近心情很糟糕、也不知道怎么了 http://t.cn/Rh2hXjy');
2,2014,9,1,17,7,31.3496,31.3496,2027776417,'m',31,15,69,128,107,0,0,'#iphone6抢先送#看看你们这些爱慕虚荣的人，不就iphone6嘛，至于嘛？对于你们...
3,2014,9,1,19,52,31.346882,31.346882,3948063033,'m',100,1000,433,66,207,0,0,'人生几何总有些坎坷需要跨越，总有些责任需要担当，不断的跌倒，才有不变的顽强与收获:不变的...
4,2014,9,1,21,44,31.34178,31.34178,2275605075,'f',31,1000,155,87,119,0,3,'心碎 ?? ??只是一瞬间 http://t.cn/Rh2fXOB');


## Step 3: Select features from the Sina Weibo dataset

1. Select the __gender__ column as target label to be used in a classification process using a neural network.

In [4]:
import numpy as np            
labelCol='gender'
#Eliminate unnecessary characters from the gender column and transform it to string.
labels= np.array(cleanData.loc[:,labelCol].str[2:-1].apply(str), np.str) 
#Transform the string column to integer values which are mandatory for the neural network function.
labels[labels=='m'] = 0
labels[labels=='f'] = 1
labelsDatasetGender= np.array(labels, np.int32) 
#Show the number of rows associated to each gender.
cleanData['gender'].value_counts()

 'f'    109416
 'm'     65551
Name: gender, dtype: int64

2\. Considering the different columns available in the dataset, obtain a subset of numeric values as features to construct a neural network based on the back propagation theory. It is important to highlight that the non-numeric features are discarded.

In [5]:
#According to the attributes nature and type, select the most representative ones as features for a classification process
FeatureCols = ["year","day","hour","minute","lon","lat","province","city","statusesCount",
               "followersCount","friendsCount", "repostsCount","commentsCount"]
cleanData=cleanData.loc[:, FeatureCols]
#Subset the features into a new variable.
featuresDataset = np.array(cleanData, np.int32)


3\. Split training and test rows for the classification process. Use the first one hundred and fifty thousand
 rows for training and the remaining for test.

In [6]:
#Split the dataset following an approach of 80% for training and 20% for test.
training = featuresDataset[:150000]
test = featuresDataset[150000:]
trainingLabels=labelsDatasetGender[:150000]
testLabels=labelsDatasetGender[150000:]

## Step 4: Create a neural network to predict Sina Weibo users __gender__ 

Create a neural network with an input layer of thirteen neurons (number of columns selected), three hidden layers of three hundred neurons each of them and a final output of two neurons taking in mind the __gender__ classification process. It is important to highlight that for the creation of the neural network a predefined function called DNNClassifier is used, considering a gradient descent optimizer and softmax cross entropy for the error rate.

<img src="backPropagation3.png" alt="Drawing" style="width: 600px;"/>


In [7]:
import tensorflow as tf
#Store the number of features (thirteen) to be used in the neural network.
#https://www.tensorflow.org/api_docs/python/tf/feature_column/numeric_column
NNFeatureCols  = [tf.feature_column.numeric_column("X", shape=[13])]
"""
Create a neural network specifying the the number of hidden layers, the number of target labels (f=1 or m=0) and 
the number of features per row in the table.
"""
#https://www.tensorflow.org/api_docs/python/tf/estimator/DNNClassifier
dnnClf = tf.estimator.DNNClassifier(hidden_units=[300,300,300], n_classes=2,feature_columns=NNFeatureCols)
#Feed the training data as well as the epoch number and batch size into the model.
#https://www.tensorflow.org/api_docs/python/tf/estimator/inputs/numpy_input_fn
#One Epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE
#Since, one epoch is too big to feed to the computer at once we divide it in several smaller batches
input_fn = tf.estimator.inputs.numpy_input_fn(
    x={"X": training}, y=trainingLabels, num_epochs=40, batch_size=500, shuffle=True)
#Train the neural network
dnnClf.train(input_fn=input_fn)



INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_keep_checkpoint_every_n_hours': 10000, '_task_type': 'worker', '_is_chief': True, '_keep_checkpoint_max': 5, '_task_id': 0, '_tf_random_seed': None, '_model_dir': 'C:\\Users\\ESTEBA~1\\AppData\\Local\\Temp\\tmpvxgkp6l_', '_save_checkpoints_secs': 600, '_session_config': None, '_master': '', '_save_summary_steps': 100, '_service': None, '_save_checkpoints_steps': None, '_train_distribute': None, '_log_step_count_steps': 100, '_num_ps_replicas': 0, '_evaluation_master': '', '_global_id_in_cluster': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000002A53F164F60>, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into C:\Users\ESTEBA

INFO:tensorflow:loss = 295.93842, step = 7201 (2.249 sec)
INFO:tensorflow:global_step/sec: 52.3908
INFO:tensorflow:loss = 304.83582, step = 7301 (1.907 sec)
INFO:tensorflow:global_step/sec: 58.4643
INFO:tensorflow:loss = 278.3709, step = 7401 (1.708 sec)
INFO:tensorflow:global_step/sec: 59.0404
INFO:tensorflow:loss = 315.68298, step = 7501 (1.696 sec)
INFO:tensorflow:global_step/sec: 56.8494
INFO:tensorflow:loss = 303.99515, step = 7601 (1.757 sec)
INFO:tensorflow:global_step/sec: 54.6962
INFO:tensorflow:loss = 280.74194, step = 7701 (1.829 sec)
INFO:tensorflow:global_step/sec: 49.5464
INFO:tensorflow:loss = 303.0869, step = 7801 (2.017 sec)
INFO:tensorflow:global_step/sec: 56.5193
INFO:tensorflow:loss = 294.41406, step = 7901 (1.770 sec)
INFO:tensorflow:global_step/sec: 53.2193
INFO:tensorflow:loss = 287.41727, step = 8001 (1.928 sec)
INFO:tensorflow:global_step/sec: 53.6074
INFO:tensorflow:loss = 293.353, step = 8101 (1.816 sec)
INFO:tensorflow:global_step/sec: 58.5151
INFO:tensorflo

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x2a53bf92dd8>

## Step 5: Test the neural network

1. Evaluate the proposed neural network model in the Complete Sina Weibo test dataset

In [8]:
#https://www.tensorflow.org/api_docs/python/tf/estimator/inputs/numpy_input_fn
#Use the trained model to predict the gender of the social media interactions associated to the test dataset
#Feed the test data into the model
testFunction = tf.estimator.inputs.numpy_input_fn(
    x={"X": test}, y=testLabels, shuffle=False)
#Evaluate the results according to accuracy, precision, recall among other metrics
eval_results = dnnClf.evaluate(input_fn=testFunction)
print(eval_results)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-07-09-01:10:13
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\ESTEBA~1\AppData\Local\Temp\tmpvxgkp6l_\model.ckpt-12000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-07-09-01:10:15
INFO:tensorflow:Saving dict for global step 12000: accuracy = 0.6591901, accuracy_baseline = 0.66111267, auc = 0.61261624, auc_precision_recall = 0.73273844, average_loss = 0.6338437, global_step = 12000, label/mean = 0.66111267, loss = 80.7407, precision = 0.69651544, prediction/mean = 0.6208308, recall = 0.85859686
{'average_loss': 0.6338437, 'label/mean': 0.66111267, 'auc': 0.61261624, 'accuracy': 0.6591901, 'prediction/mean': 0.6208308, 'accuracy_baseline': 0.66111267, 'global_step': 12000, 'auc_precision_recall': 0.73273844, 'loss': 80.7407, 'precision': 0.69651544, 're

2\. Show the evaluation for a specific test sample

In [9]:
#Iterable object that helps us to see the results individually
IterPred= dnnClf.predict(input_fn=testFunction)
#Get first prediction
y_pred = list(IterPred)
print(y_pred[0])

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\ESTEBA~1\AppData\Local\Temp\tmpvxgkp6l_\model.ckpt-12000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
{'class_ids': array([1], dtype=int64), 'probabilities': array([0.2408241, 0.7591759], dtype=float32), 'logistic': array([0.7591759], dtype=float32), 'logits': array([1.1481667], dtype=float32), 'classes': array([b'1'], dtype=object)}


## Conlusions

- What we seen so far:
    1. Load an specific dataset based on a Excel file format.
    2. Preprocess the information associated to the dataset.
    3. Creation of a Neural network using a __reduce version of the Tensorflow package__.
        * Input layer with thirteen neurons.
        * three hiden layers of three hundred neurons.
        * One ouput layer of two neurons, one for each posible gender in the dataset (0=m and 1=f).
    4. Use of the DNNClassifier function which encapsulates all the complexity associated to the creation of a neural network.
    5. Creation of training and test subsets from the Sina Weibo data.
    5. Creation of a model using the training subset provided.
    6. Evaluation of the test subset using the model created.
- Obtained results highlight the following findings:  
    1. Tensorflow permits to create a graph where nodes are operations and edges are information that flows from one operation to others.
    2. There are multiple predefined functions in Tensorflow that can help to create a neural network easily.
    3. If the computations were dense, we can distribute the job processing among different computers or gpus using Tensorflow API.