# Key Elements of Psychometric Profiling
This is a Jupyter notebook, that reimplements a few key elements of [Michal Kosinski's R tutorial](https://www.michalkosinski.com/data-mining-tutorial) for a workshop on Psychometric Profiling.

You are going to
0. Learn how to use a Jupyter notebook
1. Have a look at the original user and likes data
2. Transform the data into a format that is better suited for processing
3. Clean the data
4. Condense the data by reducing its dimensionality
5. Train a model that can predict a user's personality by xer likes
6. Evaluate the model

Please, take some notes along the way about what strikes you most in this approach to Psychometric Profiling and what you would like to share with the other participants of the workshop.

Especially if you have had no or little prior experience with programming before, some of the following advise may be useful
- You are highly encouraged to try [pair programming](https://en.wikipedia.org/wiki/Pair_programming), a concept where two people program together. This enables rapid skill sharing, creativity and reduces coding errors. Please try to pair up, such that persons without coding experience always have a more experienced partner. Don't let one person write all the code and the other observe all the time, but take turns and discuss what you are doing!
- Don't be afraid to try things out, you won't break anything.
- If you are stuck, check the documentation, ask another person or your favorite search engine

## A Jupyter Notebook

You are working in a [Jupyter notebook](https://jupyter-notebook.readthedocs.io/en/stable/), which allows to write and run code step-by-step and interactively in the browser. The underlying programming language is called [Python](https://www.python.org/) and is installed on most operating systems.

You can select a cell to write code. When you are done, you can run the code by clicking on the play arrow at the top of this window or press `ctrl-return`

Try it out with the code cell below. You can try some of the following
- Math: `3 * 2 + 1`
- Text: `print("Hello," + " world!")`
- Help: `help(print)`

In [1]:
print("Hello," + " world!")

Hello, world!


# Loading the data

The data used here is part of [Michal Kosinski's tutorial](https://www.michalkosinski.com/data-mining-tutorial). It consists of three tables, saved in so called [CSV files](https://en.wikipedia.org/wiki/Comma-separated_values). The tables are:
- user data including their ID, age, gender, political view and psychological profile according to an OCEAN/BigFive personality test
- likes data which is an ID and the name of the like
- users_likes data, which are user-ID and like-ID combinations, to see which user liked what

Usefule Python code can be loaded from so called libraries, that contain methods for specific purposes. A prominent library for data analysis is called [Pandas](http://pandas.pydata.org/pandas-docs/stable/). The following code imports the method `read_csv` to read the data from [CSV files](https://en.wikipedia.org/wiki/Comma-separated_values). To call a method, you need to write `method_name(arguments)`, where here `arguments` is just the location of the specific CSV file.

Pandas structures tabular data in [DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), that supply many useful methods in dot-notation. One of them is the `head` method, which can be called without any arguments to show the first five rows of a DataFrame. Run the cell below, to see the structure of the `users` table.

In [3]:
from pandas import read_csv
users = read_csv('data/users.csv')
users_likes = read_csv('data/users-likes.csv')
likes = read_csv('data/likes.csv')
print(users.head())

                             userid  gender  age  political   ope   con   ext  \
0  54f34605aebd63f7680e37ffd299af79       0   33        0.0  1.26  1.65  1.17   
1  86399f8c44ba54224b2e60177ca89fa9       1   35        0.0  1.07  0.17 -0.14   
2  84fab50f3c60d1fdc83aa91b5e584a78       1   36        0.0  0.89  1.28  0.86   
3  f3b8fdaccce12ef6352bfad4d6052fe9       0   39        NaN  0.33 -1.01 -0.33   
4  8b06ea5e9cb87c61da387995450607f7       0   31        NaN  0.15  0.47  1.17   

    agr   neu  
0 -1.76  0.61  
1  1.49  0.30  
2  1.07  0.99  
3 -0.68  0.92  
4 -1.01 -0.32  


Did you see anything special in the data so far? For example the column `gender` is encoded as 0 or 1. Maybe you also wonder about the `NaN` in the `political` column? This is an abbreviation for "Not a number" and denotes a missing value, i.e. the user's political orientation is unknown. Otherwise political orientation is also encoded as 0 or 1 (0.0 is the floating point notation for the integer 0). The last five columns stand for the OCEAN profile of the given user.

You can try out the `head` method on the other data. Or you might wonder how to show more than 5 rows. Try `help(users.head)`. Or you might wonder what other methods can be called on a `DataFrame`. You can see this simply with `help(users)`. For example the `describe` method can also be interesting for a first overview on unknown data. 

Another useful hint: You can use the `TAB` key for auto-completion, so writing `users.he` and pressing `TAB` should automatically write `head`.

# Data Transformation

Up to now the data is structured into three separate tables. The tools that are used by Psychometric Profiling need the data in a single structure, which is called a sparse matrix and has the following form, where a 1 in a cell means that the user of that column liked the like of that row and a `.` denotes an empty cell, meaning that the given user did not like the given like

|  .  | UserA | UserB | UserC | ... |
|-----|-------|-------|-------|-----|
|LikeA|   .   |   1   |   1   |  .  |
|LikeB|   1   |   .   |   .   |  .  |
|LikeC|   .   |   .   |   1   |  .  |
| ... |   .   |   .   |   .   |  .  |

Merging the tables into one table is possibly with Pandas' merge method. Once that is done, we can construct a sparse matrix from it. The implementation is done by the library [scipy](https://docs.scipy.org/doc/scipy/reference/sparse.html), which contains many methods for scientific programming.

Reformatting data to fit the required format can be tedious. You need to know the right methods. But it usually is not very enlightening, so you do not have to spend to much time on the following piece of code.

In [4]:
from pandas import merge
from scipy.sparse import coo_matrix


def construct_user_likes_matrix(users, likes, users_likes):
    users["user_row"] = range(len(users))  # Adding a counter to the users
    likes["like_row"] = range(len(likes))  # Adding a counter to the likes
    ul = merge(users_likes, likes[["likeid", "like_row"]], on="likeid")  # Adds the like counter to the users_likes table
    ul = merge(ul, users[["userid", "user_row"]], on="userid")  # Adds the users counter to the users_likes table
    ul_sparse = coo_matrix(([True] * len(ul), (ul["user_row"], ul["like_row"])),  # Required format for the constructor is (cell_values, (column_indices, row_indices))
                           shape=(len(users), len(likes)),  # Users are in the columns, likes are in the rows
                           dtype=bool)  # Every cell contains True or False / 1 or 0
    return ul_sparse


ul_sparse = construct_user_likes_matrix(users, likes, users_likes)


# Data cleaning

Another tedious step is to clean the data. Typically, prediction models only work well when there is a lot of data. We already saw, that there is a lot of missing data. Kosinski et al propose a classical way: Throw out all the users that have not liked many likes and throw away all the likes that have not been liked by many users.

Again, it is not too instructive to understand all the details of the code. Still, you can try what happens if you change the parameters `min_likes` or `min_users`. How does it change the size of the resulting matrix and users table? What other ways could be used to deal with the problem that prediction models dont work that well? What does this mean for the people being represented in the users data?

In [5]:
def trim_ul(ul_sparse, users, min_likes=150, min_users=50):
    ul_sparse = ul_sparse.tocsc()
    while True:
        i = sum(ul_sparse.shape)
        # sum returns a 1xLikes matrix, A1 returns as flattened 1D-array
        enough_likes = ul_sparse.sum(axis=1).T.A1 > min_likes
        ul_sparse = ul_sparse[enough_likes, :][:, ul_sparse.sum(axis=0).A1 > min_users]
        users = users.iloc[enough_likes]
        if i == sum(ul_sparse.shape):
            break
    return ul_sparse, users


ul_sparse_trimmed, users_trimmed = trim_ul(ul_sparse, users)  # This trims both users and likes to about one 15th


How many likes fit the criterium, how many users are still present in the cleaned data? Can you find out how many likes a given user has? Or by how many users a like has been liked?

In [None]:
# Add your own code



# Dimensionality reduction
The goal of this psychometric profiling is to predict some part of a person's personality by xer likes in social network. There are very many likes to predict just one or a few numbers (like a person's openness). Many statistical models do not work well in a situation like this. One solution is to use a "dimensionality reduction". Here, this tries to find some sort of most varying combinations of likes in the data. For example, if most users liked "Cats", then liking "Cats" is not very informative and will be removed from the data. Instead, the algorithm tries to find combinations of likes, where the users disagree a lot. For example, half of the users liked "Dogs" and "Sausages", whereas the other half did not click these items.

Why might this be useful?

The library used for this is called scikit-learn and contains many methods for predictive statistics, a.k.a. machine learning. You can also try other techniques to reduce the dimension of the data, like other algorithms for [dimensionality reduction](https://scikit-learn.org/stable/modules/unsupervised_reduction.html) or [manifold learning](https://scikit-learn.org/stable/modules/manifold.html).


In [6]:
from sklearn.decomposition import TruncatedSVD

ul_lowdim_svd = TruncatedSVD(30).fit_transform(ul_sparse_trimmed)

# Training a model
Now that we have clean data in a condensed form, we can use a classical algorithm called Linear Regression to train a model that shall predict a given output by the data. In our case, we want to predict the openness score by the condensed likes.

You can try other algorithms, like [Generalized Linear Models](https://scikit-learn.org/stable/modules/linear_model.html#linear-model) or [Support Vector Machines](https://scikit-learn.org/stable/modules/svm.html) or [Ensemble methods](https://scikit-learn.org/stable/modules/ensemble.html).
What part of reality is modeled by these algorithms? 

In [7]:
from sklearn.linear_model import LinearRegression

ope_lm = LinearRegression().fit(ul_lowdim_svd, users_trimmed["ope"])

# Evaluating a model
How well are we performing with our model? To answer this question, we need to define some measure of quality. We might stick to statistical methods, like [Pearson's correlation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html). This method returns two values, the first one being a correlation between the data sets (0=no correlation, 1=perfect correlation) and the second a p-value, testing how likely two random data sets would produce such a correlation.

In [8]:
from scipy.stats import pearsonr
pearsonr(ope_lm.predict(ul_lowdim_svd), users_trimmed["ope"])  # First value should be above .4

(0.4439544378352614, 0.0)

Caveat: The model we have trained has seen the data we are testing it on during training. Basically, we were telling it how high a user scored on openness and then ask it whether it could remember this value. We might be more interested in evaluating it on data that it did not see during training. The following code does this by a so called train-test-split. With the linear model on the condensed data the difference is negligible. But this would not be true for all models.

Try to play around:
- Use other models
- Predict other values
- Pick a different evaluation function
- Clean or condense the data differently
- ...

If you want to keep some result, you can add a new cell below by clicking on the `+` on top of this window.

So now you have learned more about the technical background. What does this tell us about Psychometric Profiling in our society? What are the assumptions in it? What world view does it convey? Who can use it?

In [11]:
from sklearn.model_selection import train_test_split
ul_train, ul_test, open_train, open_test = train_test_split(ul_lowdim_svd, users_trimmed["ope"],
                                                           test_size=0.1)
ope_lm = LinearRegression().fit(ul_train, open_train)
pearsonr(ope_lm.predict(ul_test), open_test)

(0.4598477290191918, 1.1568807436605519e-44)