# Team 5A09 DMA Course: Project Sina Weibo Interaction Prediction Challenge
![](weibo.jpg)
## Determining Statistical Factors
### Authors: Apoorva Malemath, Arundati Dixit, Ashish Kar, Deepti Nadkarni

In [1]:
import import_ipynb
import pandas as pd
from genUidStat import loadData,genUidStat
from evaluation import precision
from runTime import runTime

importing Jupyter notebook from genUidStat.ipynb
importing Jupyter notebook from evaluation.ipynb
importing Jupyter notebook from runTime.ipynb


# Information on Loaded Modules

## genUidStat.ipynb
Loads train and predict dataset as well as generated UID stats with statistical measures for further analysis
## evaluation.ipynb
evaluation function  accoding to official rule:
    http://tianchi.aliyun.com/competition/information.htm?spm=5176.100067.5678.2.Grh4pl&raceId=5
    
## runTime.ipynb
A basic run time function for run time calculation

# Prerequisites

## 1. Generate UID Stats with statistical measures for FCL

We will find Mean, Median, Max and Min of Forward, Comment and Likes for every unique UID in train dataset for our further statistical analysis

In [1]:
df=pd.read_csv("train_uid_stat.csv")

NameError: name 'pd' is not defined

### Example For UID stats
#### Say in train dataset, For UID x there are two MID(ie two posts):
###### Train Dataset: 
![](u1.png)
###### UID Stats:
![](u2.png)
#### Now Consider that same user has 4 mids in predict dataset, so prediction of FCL by factor "mean" will be as follows:
###### Predict Dataset
![](u3.png)
#### Similary by factor "max" :
###### Predict Dataset
![](u4.png)

In [10]:
df.head(50)

Unnamed: 0,u_id,forward_min,forward_max,forward_median,forward_mean,comment_min,comment_max,comment_median,comment_mean,like_min,like_max,like_median,like_mean
0,000127c6126e2b0019f255ed21ac1cb7,0,1,0,0,0,0,0,0,0,0,0,0
1,0001565a5edece1669577e2ace9a6a3d,0,0,0,0,0,1,0,0,0,0,0,0
2,00033a6513b86b2705de9ffa9d37ffb6,0,0,0,0,0,0,0,0,0,1,0,0
3,0004fe2742507420eaa73e119dc83ac5,0,6,0,0,0,1,0,0,0,1,0,0
4,000c663a24a2f91f4ba156fcd4f8b9f2,0,1,0,0,0,7,0,0,0,6,0,0
5,000ce19d2fccb1f22421bec50bf25b08,0,0,0,0,0,0,0,0,0,0,0,0
6,000d7bf7406392b2212dfb4fe907d946,0,0,0,0,0,0,0,0,0,0,0,0
7,0012edb614365800e901c7f2b47e9129,0,0,0,0,0,4,0,1,0,0,0,0
8,001349a053bdecf1a71960f29288ced1,0,0,0,0,0,1,0,0,0,1,0,0
9,0015c42ec93854687a258a7f170c6acf,0,0,0,0,0,0,0,0,0,0,0,0


## 2. Use Offical Formula to dertermine accuracy for statistical factors

![](formula.jpg)

![](pyformula.png)

# Predict with fixed Value

##  1. Default Values

About 80% of the training data are: 0 0 0 (forward_count,comment_count,like_count) and also, 96% of uid in predict dataset is    present in train dataset, for remaining 4% which are new, we need some default values.
inspired by this, we try some fixed value for all uid:

### Function to take Fixed FCL Values, Give Accuracy and Generate Predicted FCL

In [2]:
@runTime
def predict_with_fixed_value(forward,comment,like,submission=True):
	# type check
	if isinstance(forward,int) and isinstance(forward,int) and isinstance(forward,int):
		pass
	else:
		raise TypeError("forward,comment,like should be type 'int' ")
	
	traindata,testdata = loadData()
	
	#score on the training set
	train_real_pred = traindata[['forward_count','comment_count','like_count']]
	train_real_pred['fp'],train_real_pred['cp'],train_real_pred['lp'] = forward,comment,like
	print ("Score on the training set:{0:.2f}%".format(precision(train_real_pred.values)*100))
	
	#predict on the test data with fixed value, generate submission file
	if submission:
		test_pred = testdata[['u_id','m_id']]
		test_pred['fp'],test_pred['cp'],test_pred['lp'] = forward,comment,like
		
		result = []
		filename = "weibo_predict_{}_{}_{}.txt".format(forward,comment,like)
		for _,row in test_pred.iterrows():
			result.append("{0}\t{1}\t{2},{3},{4}\n".format(row[0],row[1],row[2],row[3],row[4]))
		f = open(filename,'w')
		f.writelines(result)
		f.close()
		print ('generate submission file "{}"'.format(filename))

## 2. UID Statistics (Mean, Max, Min, Median)

Another wise solution is to predict respectively with uid's statistics(E.g mean,median)	,
their score on the training data:

### Function to take Statistical Factor, Give Accuracy and Generate Predicted FCL

In [3]:
@runTime	
def predict_with_stat(stat="median",submission=True):
	"""
	stat:
		string
		min,max,mean,median
	"""
	stat_dic = genUidStat()
	traindata,testdata = loadData()
	
	#get stat for each uid
	forward,comment,like = [],[],[]
	for uid in traindata['u_id']:
		if uid in stat_dic:
			forward.append(int(stat_dic[uid]["forward_"+stat]))
			comment.append(int(stat_dic[uid]["comment_"+stat]))
			like.append(int(stat_dic[uid]["like_"+stat]))
		else:
			forward.append(0)
			comment.append(0)
			like.append(0)
	#score on the training set
	train_real_pred = traindata[['forward_count','comment_count','like_count']]
	train_real_pred['fp'],train_real_pred['cp'],train_real_pred['lp'] = forward,comment,like
	print ("Score on the training set:{0:.2f}%".format(precision(train_real_pred.values)*100))
	
	#predict on the test data with fixed value, generate submission file
	if submission:
		test_pred = testdata[['u_id','m_id']]
		forward,comment,like = [],[],[]
		for uid in testdata['u_id']:
			if uid in stat_dic:
				forward.append(int(stat_dic[uid]["forward_"+stat]))
				comment.append(int(stat_dic[uid]["comment_"+stat]))
				like.append(int(stat_dic[uid]["like_"+stat]))
			else:
				forward.append(0)
				comment.append(0)
				like.append(0)
				
				
		test_pred['fp'],test_pred['cp'],test_pred['lp'] = forward,comment,like
		
		result = []
		filename = "weibo_predict_{}.txt".format(stat)
		for _,row in test_pred.iterrows():
			result.append("{0}\t{1}\t{2},{3},{4}\n".format(row[0],row[1],row[2],row[3],row[4]))
		f = open(filename,'w')
		f.writelines(result)
		f.close()
		print ('generate submission file "{}"'.format(filename))

# Ready to check accuracy of various statistical factors........

In [27]:
if __name__ == "__main__":
		predict_with_stat(stat="median",submission=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Score on the training set:32.73%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


generate submission file "weibo_predict_median.txt"
predict_with_stat run time: 135.31s


![](median.png)

In [29]:
if __name__ == "__main__":
		predict_with_fixed_value(0,1,1,submission=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


Score on the training set:26.43%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


generate submission file "weibo_predict_0_1_1.txt"
predict_with_fixed_value run time: 68.95s


![](011.png)

In [4]:
if __name__ == "__main__":
		predict_with_stat(stat="mean",submission=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Score on the training set:30.17%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


generate submission file "weibo_predict_mean.txt"
predict_with_stat run time: 132.35s


![](mean.png)

In [5]:
if __name__ == "__main__":
		predict_with_stat(stat="max",submission=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Score on the training set:7.13%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


generate submission file "weibo_predict_max.txt"
predict_with_stat run time: 132.56s


![](max.png)

In [6]:
if __name__ == "__main__":
		predict_with_stat(stat="min",submission=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Score on the training set:26.07%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


generate submission file "weibo_predict_min.txt"
predict_with_stat run time: 131.45s


![](min.png)

In [7]:
if __name__ == "__main__":
		predict_with_fixed_value(0,0,0,submission=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


Score on the training set:25.98%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


generate submission file "weibo_predict_0_0_0.txt"
predict_with_fixed_value run time: 72.05s


In [4]:
if __name__ == "__main__":
		predict_with_fixed_value(0,0,1,submission=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


Score on the training set:26.11%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


generate submission file "weibo_predict_0_0_1.txt"
predict_with_fixed_value run time: 66.76s


In [5]:
if __name__ == "__main__":
		predict_with_fixed_value(0,1,0,submission=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


Score on the training set:25.95%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


generate submission file "weibo_predict_0_1_0.txt"
predict_with_fixed_value run time: 68.40s


In [6]:
if __name__ == "__main__":
		predict_with_fixed_value(1,0,0,submission=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


Score on the training set:22.22%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


generate submission file "weibo_predict_1_0_0.txt"
predict_with_fixed_value run time: 67.65s


In [7]:
if __name__ == "__main__":
		predict_with_fixed_value(1,0,1,submission=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


Score on the training set:23.44%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


generate submission file "weibo_predict_1_0_1.txt"
predict_with_fixed_value run time: 69.00s


In [8]:
if __name__ == "__main__":
		predict_with_fixed_value(1,1,0,submission=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


Score on the training set:21.28%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


generate submission file "weibo_predict_1_1_0.txt"
predict_with_fixed_value run time: 71.24s


In [9]:
if __name__ == "__main__":
		predict_with_fixed_value(1,1,1,submission=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


Score on the training set:10.18%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


generate submission file "weibo_predict_1_1_1.txt"
predict_with_fixed_value run time: 69.53s


## Overall Results

![](overall1.png)

# Current Weibo Sina Interation Prediction Leaderboard by Aliyun.com

![](leaderboard.jpg)

# References

https://github.com/wepe/AliTianChi/tree/master  (A Statistical Analysis on Weibo Sina Interaction Prediction 2014 Challenge)