# Readme

This data analysis is based off data stored in this spreadsheet:

https://docs.google.com/spreadsheets/d/1QRv_0Amjex5fL6klhLXW8r2oSuihp3oaOnB2S5T-Wbk/edit?usp=sharing


We conducted a survey over 2020-2021, and got ~350 respondents.  Each respondent filled in a traditional psychometric survey (the IPIP 50-item Big 5 survey) and also provided their usernames on 4 social media websites: LinkedIn, Twitter, StackOverflow and Reddit.  

Public content on each of those 4 websites was scraped into a text string, and the string was then passed to IBM Watson's Personality Insights Natural Language Processing tool (now deprecated).  Personality Insights then assigned trait scores to each respondent based on their social media content.  This blog post contains a useful summary of standard NLP methodology: https://towardsdatascience.com/text-analytics-what-does-your-linkedin-profile-summary-say-about-your-personality-f80df46875d1

BIG5 Data Analysis.ipynb takes those two streams of data as inputs:
1) IPIP 50-item survey responses
2) IBM's Personality Ingsights scores

Data analysis process:
1) Data was imported from the Google Spreadsheet, using the Gspread package
2) Demographic distributions were investigated.  Most of our respondents were young (18-24), Male, and Australian.
3) The IPIP 50-item scores were calculated from the survey responses, according to the scoring here https://ipip.ori.org/new_ipip-50-item-scale.htm
4) The distributions of the IPIP 50-item scores were investigated.  All the traits were normally distributed, but bunched in the upper range.  There were no scores less than ~0.2 in any trait.
5) IBM's Personality Insights scores for each respondent were imported from the same spreadsheet
6) The distributions were investigated.  Each trait was normally distributed, but again the range was restricted.  All the responses were in the 'average' zone of 0.4 - 0.6.  
7) The two streams of data were compiled into one dataframe for further analysis.  Different dataframes filtered on wordcount and raw/percentile scores were created
8) Correlation analysis: the correlations between the IPIP-50 item scores and the Personality Insights scores were investigated.  
- Raw scores were investigated using Pearson's r
- Percentile scores were investigated using Spearman's rho
- Percentile scores and Spearman's rho were chosen as most appropriate for this analysis.  For personality analysis we are less concerned with the actual score, and more with an individual's score relative to the cohort
9) Linear regression:
- We were interested in how online content scores (independent variable, X) could approximate the IPIP 50-item scores (dependent variable, y)
- Multiple regressions were conducted with demographic variables (age, country, gender) included as independent variables.  This was done to check the importance of demographics in estimated personality traits.  None of the demographic variables were significant in any model.
- Simple regressions were conducted with no demographic variables included.  This method resulted in much stronger and more predictive models (R-square > 0.95 in most cases), and strongly significant Prob(F) scores (< 0.01 in all cases)


Steps to re-create data analysis:
1) make sure you have the correct credentials (survey-personality-71154dfbe30a.json) stored in your working directory
2) click 'Run all' on the ipynb file