# Analysis of the Stackoverflow Survey of 2018

In this notebook, some analysis on the stackoverflow survey of 2018 will be done.

The analysis is done by following the steps in the table of contents. 

## Table of Contents

I.  [Which questions should be answered?](#Exploratory-Data-Analysis)<br>
II. [Clean the data](#Clean)<br>
III. [User-User Based Collaborative Filtering](#User-User)<br>
IV. [Content Based Recommendations (EXTRA - NOT REQUIRED)](#Content-Recs)<br>
V. [Matrix Factorization](#Matrix-Fact)<br>
VI. [Extras & Concluding](#conclusions)

(gather, assess, clean, analyze, model, visualize) ´TODO´

In [75]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

df = pd.read_csv('../data/survey_results_public.csv')
df_schema = pd.read_csv('../data/survey_results_schema.csv')

# Show df to get an idea of the data
df.head()

Unnamed: 0,Respondent,Professional,ProgramHobby,Country,University,EmploymentStatus,FormalEducation,MajorUndergrad,HomeRemote,CompanySize,...,StackOverflowMakeMoney,Gender,HighestEducationParents,Race,SurveyLong,QuestionsInteresting,QuestionsConfusing,InterestedAnswers,Salary,ExpectedSalary
0,1,Student,"Yes, both",United States,No,"Not employed, and not looking for work",Secondary school,,,,...,Strongly disagree,Male,High school,White or of European descent,Strongly disagree,Strongly agree,Disagree,Strongly agree,,
1,2,Student,"Yes, both",United Kingdom,"Yes, full-time",Employed part-time,Some college/university study without earning ...,Computer science or software engineering,"More than half, but not all, the time",20 to 99 employees,...,Strongly disagree,Male,A master's degree,White or of European descent,Somewhat agree,Somewhat agree,Disagree,Strongly agree,,37500.0
2,3,Professional developer,"Yes, both",United Kingdom,No,Employed full-time,Bachelor's degree,Computer science or software engineering,"Less than half the time, but at least one day ...","10,000 or more employees",...,Disagree,Male,A professional degree,White or of European descent,Somewhat agree,Agree,Disagree,Agree,113750.0,
3,4,Professional non-developer who sometimes write...,"Yes, both",United States,No,Employed full-time,Doctoral degree,A non-computer-focused engineering discipline,"Less than half the time, but at least one day ...","10,000 or more employees",...,Disagree,Male,A doctoral degree,White or of European descent,Agree,Agree,Somewhat agree,Strongly agree,,
4,5,Professional developer,"Yes, I program as a hobby",Switzerland,No,Employed full-time,Master's degree,Computer science or software engineering,Never,10 to 19 employees,...,,,,,,,,,,


In [76]:
df_schema.head()

Unnamed: 0,Column,Question
0,Respondent,Respondent ID number
1,Professional,Which of the following best describes you?
2,ProgramHobby,Do you program as a hobby or contribute to ope...
3,Country,In which country do you currently live?
4,University,"Are you currently enrolled in a formal, degree..."


In [77]:
assert df.shape[1] == df_schema.shape[0]
print('The survey contained {} questions. In total, there are {} survey responses.'.format(df.shape[1], df.shape[0]))

The survey contained 154 questions. In total, there are 19102 survey responses.


### <a class="anchor" id="Exploratory-Data-Analysis">Part I : Which questions should be answered?</a>

The schema and the data are used to find interessting questions which should be answered in the following analysis.

In [78]:
df_schema.head(n=10)

Unnamed: 0,Column,Question
0,Respondent,Respondent ID number
1,Professional,Which of the following best describes you?
2,ProgramHobby,Do you program as a hobby or contribute to ope...
3,Country,In which country do you currently live?
4,University,"Are you currently enrolled in a formal, degree..."
5,EmploymentStatus,Which of the following best describes your cur...
6,FormalEducation,Which of the following best describes the high...
7,MajorUndergrad,Which of the following best describes your mai...
8,HomeRemote,How often do you work from home or remotely?
9,CompanySize,"In terms of the number of employees, how large..."


**Questions which will be answered in the following parts:**

|#| Question | Additional Information | Helpful columns | Target column |
| ---| :--- | :---| :---| :---|
|1| Does the company influence the happiness/satisfaction of the users?  | only for employed users of a company (EmploymentStatus) | EmploymentStatus, CompanySize, CompanyType, InfluenceInternet, InfluenceWorkstation, InfluenceHardware, InfluenceServers, InfluenceTechStack, InfluenceDeptTech, InfluenceVizTools, InfluenceDatabase, InfluenceCloud, InfluenceConsultants, InfluenceRecruitment, InfluenceCommunication | CareerSatisfaction, JobSatisfaction  |
|2| Exists a correlation between "Overpaid" and "Salary", depending on the experience?  | - | YearsProgram, Overpaid, Salary | *TODO: Maybe use one to predict the other?*|
|3| Is there a programing language specific correlation between "OtherPeoplesCode - Maintaining other people's code is a form of torture" and "EnjoyDebugging -I enjoy debugging code"?| - | HaveWorkedLanguage, OtherPeoplesCode, EnjoyDebugging | *TODO: Maybe use one to predict the other?*|
|4|How many people, who program in Python, follow the PEP8 guidlines and use spaces instead of tabs?|-|HaveWorkedLanguage |TabsSpaces|

The questions have been found by looking at the df_schema in detail to find interesting questions.

### <a class="anchor" id="Clean">Part II: Clean the data</a>

In [Part |](#Exploratory-Data-Analysis) the needed columns are defined. In the following, the data is preperaded for each question. Only the needed columns are modified and cleaned.

In [227]:
df_q1 = df[['Respondent','EmploymentStatus', 'CompanySize', 'CompanyType', 'InfluenceInternet', 'InfluenceWorkstation', 'InfluenceHardware', 
        'InfluenceServers', 'InfluenceTechStack', 'InfluenceDeptTech', 'InfluenceVizTools', 
        'InfluenceDatabase', 'InfluenceCloud', 'InfluenceConsultants', 'InfluenceRecruitment', 
        'InfluenceCommunication', 'CareerSatisfaction', 'JobSatisfaction']]
df_q2 = df[['YearsProgram', 'Overpaid', 'Salary']]
df_q3 = df[['HaveWorkedLanguage', 'OtherPeoplesCode', 'EnjoyDebugging']]
df_q4 = df[['HaveWorkedLanguage', 'TabsSpaces']]

#### Cleaning for question 1: Does the company influence the happiness/satisfaction of the users?

In [88]:
df_q1.head()

Unnamed: 0,EmploymentStatus,CompanySize,CompanyType,InfluenceInternet,InfluenceWorkstation,InfluenceHardware,InfluenceServers,InfluenceTechStack,InfluenceDeptTech,InfluenceVizTools,InfluenceDatabase,InfluenceCloud,InfluenceConsultants,InfluenceRecruitment,InfluenceCommunication,CareerSatisfaction,JobSatisfaction
0,"Not employed, and not looking for work",,,Not very satisfied,,,,,,,,,,,,,
1,Employed part-time,20 to 99 employees,"Privately-held limited company, not in startup...",Satisfied,No influence at all,No influence at all,No influence at all,No influence at all,No influence at all,No influence at all,No influence at all,No influence at all,No influence at all,No influence at all,No influence at all,,
2,Employed full-time,"10,000 or more employees",Publicly-traded corporation,Very satisfied,A lot of influence,Some influence,Some influence,Some influence,A lot of influence,Some influence,Some influence,Some influence,Some influence,Some influence,Some influence,8.0,9.0
3,Employed full-time,"10,000 or more employees",Non-profit/non-governmental organization or pr...,,,,,,,,,,,,,6.0,3.0
4,Employed full-time,10 to 19 employees,"Privately-held limited company, not in startup...",Satisfied,,,,,,,,,,,,6.0,8.0


In [89]:
# EMPLOYMENT STATUS - Question 1 is only relevant for people who work in a company
print('Answer possibilities for EmploymentStatus: ',df_q1.EmploymentStatus.unique())
df_q1 = df_q1[df_q1.EmploymentStatus.isin(['Employed full-time', 'Employed part-time'])]
df_q1 = df_q1.drop(labels = ['EmploymentStatus'],axis = 1)
print('The number of survey responsed reduced from {} to {}.'.format(df.shape[0], df_q1.shape[0]))

Answer possibilities for EmploymentStatus:  ['Not employed, and not looking for work' 'Employed part-time'
 'Employed full-time'
 'Independent contractor, freelancer, or self-employed'
 'Not employed, but looking for work' 'I prefer not to say' 'Retired']
The number of survey responsed reduced from 19102 to 14823.


In [118]:
# COMPANY SIZE
print('Answer possibilities for CompanySize: ',df_q1.CompanySize.unique())
# remove entries which do not provide additional information
df_q1.CompanySize = df_q1.CompanySize.replace(["I don't know", "I prefer not to answer"], np.NaN)
df_q1 = df_q1.dropna(axis = 0, subset=['CompanySize'])
#print(df_q1.shape)
#df_q1 = pd.concat([df_q1, pd.get_dummies(df_q1.CompanySize, prefix='CompanySize')], ignore_index = True)
tmp = pd.get_dummies(df_q1.CompanySize, prefix='CompanySize')
#print(tmp.shape)
df_q1[tmp.columns] = tmp
#print(df_q1.shape)
df_q1 = df_q1.drop(labels=['CompanySize'], axis = 1)
#print(df_q1.columns)

Answer possibilities for CompanySize:  [nan '20 to 99 employees' '10,000 or more employees' '10 to 19 employees'
 'Fewer than 10 employees' '5,000 to 9,999 employees'
 '100 to 499 employees' '1,000 to 4,999 employees' '500 to 999 employees'
 "I don't know" 'I prefer not to answer']


In [121]:
# COMPANY TYPE
print('Answer possibilities for CompanyTypeSize: ',df_q1.CompanyType.unique())
# remove entries which do not provide additional information
df_q1.CompanyType = df_q1.CompanyType.replace(["I don't know", "I prefer not to answer"], np.NaN)
df_q1 = df_q1.dropna(axis = 0, subset=['CompanyType'])
#print(df_q1.shape)
# df_q1 = pd.concat([df_q1, pd.get_dummies(df_q1.CompanyType, prefix='CompanyType')])
tmp = pd.get_dummies(df_q1.CompanyType, prefix='CompanyType')
#print(tmp.shape)
df_q1[tmp.columns] = tmp
#print(df_q1.shape)
df_q1 = df_q1.drop(labels=['CompanyType'], axis = 1)
#print(df_q1.columns)

AttributeError: 'DataFrame' object has no attribute 'CompanyType'

In [122]:
# INFLUENCE
influence_columns = ['InfluenceInternet', 'InfluenceWorkstation', 'InfluenceHardware', 
        'InfluenceServers', 'InfluenceTechStack', 'InfluenceDeptTech', 'InfluenceVizTools', 
        'InfluenceDatabase', 'InfluenceCloud', 'InfluenceConsultants', 'InfluenceRecruitment', 
        'InfluenceCommunication']
for column in influence_columns:
    df_q1 = df_q1.dropna(axis = 0, subset=[column])
    tmp = pd.get_dummies(df_q1[column], prefix=column)
    #print(tmp.shape)
    df_q1[tmp.columns] = tmp
    #print(df_q1.shape)
    df_q1 = df_q1.drop(labels=[column], axis = 1)

    
    #df_q1 = df_q1.dropna(axis = 0, subset=[column])
    #df_q1 = pd.concat([df_q1, pd.get_dummies(df_q1[column], prefix=column)])
    #df_q1 = df_q1.drop(labels=[column], axis = 1)
#unique_values


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]


In [127]:
# JOB SATISFACTION
mean = int(df_q1.JobSatisfaction.mean())
# as the satisfaction is on a scale, the mean has been converted to an integer as well
df_q1.JobSatisfaction = df_q1.JobSatisfaction.fillna(mean)

In [129]:
# CAREER SATISFACTION
mean = int(df_q1.CareerSatisfaction.mean())
# as the satisfaction is on a scale, the mean has been converted to an integer as well
df_q1.CareerSatisfaction = df_q1.CareerSatisfaction.fillna(mean)

#### Cleaning for question 2: Exists a correlation between "Overpaid" and "Salary"?

In [138]:
# Calculation makes only sense if both values are available
df_q2 = df_q2.dropna(subset = ['Overpaid', 'Salary'])
print('The number of survey responsdents reduced from {} to {}.'.format(df.shape[0], df_q2.shape[0]))

The number of survey responsdents reduced from 19102 to 4995.


In [145]:
# YEARS PROGRAM
print('Answer possibilities for YearsProgram: ',df_q2.YearsProgram.unique())
# remove entries which do not provide additional information
df_q2 = df_q2.dropna(axis = 0, subset=['YearsProgram'])
print(df_q2.shape)
tmp = pd.get_dummies(df_q2.YearsProgram, prefix='YearsProgram')
print(tmp.shape)
df_q2[tmp.columns] = tmp
print(df_q2.shape)
df_q2 = df_q2.drop(labels=['YearsProgram'], axis = 1)

Answer possibilities for YearsProgram:  ['20 or more years' '2 to 3 years' '10 to 11 years' '7 to 8 years'
 '4 to 5 years' '8 to 9 years' '11 to 12 years' '3 to 4 years'
 '5 to 6 years' '9 to 10 years' '17 to 18 years' '14 to 15 years'
 '1 to 2 years' '16 to 17 years' '12 to 13 years' '18 to 19 years'
 '6 to 7 years' '15 to 16 years' '13 to 14 years' '19 to 20 years'
 'Less than a year']
(4989, 24)
(4989, 21)
(4989, 24)


In [147]:
# OVERPAID
print('Answer possibilities for Overpaid: ',df_q2.Overpaid.unique())
# remove entries which do not provide additional information
df_q2 = df_q2.dropna(axis = 0, subset=['Overpaid'])
print(df_q2.shape)
tmp = pd.get_dummies(df_q2.Overpaid, prefix='Overpaid')
print(tmp.shape)
df_q2[tmp.columns] = tmp
print(df_q2.shape)
df_q2 = df_q2.drop(labels=['Overpaid'], axis = 1)

Answer possibilities for Overpaid:  ['Neither underpaid nor overpaid' 'Somewhat underpaid' 'Somewhat overpaid'
 'Greatly underpaid' 'Greatly overpaid']
(4989, 28)
(4989, 5)
(4989, 28)


#### Cleaning for question 3: Is there a correlation between "OtherPeoplesCode - Maintaining other people's code is a form of torture" and "EnjoyDebugging -I enjoy debugging code"?

In [228]:
# Calculation makes only sense if both values 'OtherPeoplesCode' and 'EnjoyDebugging' are available
df_q3 = df_q3.dropna(how = 'any')
df_q3 = df_q3.reset_index(drop = True)
print('The number of survey responsdents reduced from {} to {}.'.format(df.shape[0], df_q3.shape[0]))

The number of survey responsdents reduced from 19102 to 10103.


In [229]:
df_q3.head()

Unnamed: 0,HaveWorkedLanguage,OtherPeoplesCode,EnjoyDebugging
0,JavaScript; Python; Ruby; SQL,Disagree,Agree
1,Java; PHP; Python,Disagree,Agree
2,Matlab; Python; R; SQL,Agree,Somewhat agree
3,CoffeeScript; Clojure; Elixir; Erlang; Haskell...,Somewhat agree,Agree
4,C#; JavaScript,Somewhat agree,Agree


In [203]:

# create the user-article matrix with 1's and 0's,
,
def create_user_item_matrix(df):,
    ''',
    INPUT:,
    df - pandas dataframe with article_id, title, user_id columns,
    ,
    OUTPUT:,
    user_item - user item matrix ,
    ,
    Description:,
    Return a matrix with user ids as rows and article ids on the columns with 1 values where a user interacted with ,
    an article and a 0 otherwise,
    ''',
    # Fill in the function here,
    n_rows = df.user_id.unique().shape[0],
    n_cols = df.article_id.unique().shape[0],
,
    user_item = pd.DataFrame(data = np.zeros((n_rows, n_cols)), index = df.user_id.unique(), columns = df.article_id.unique()),
    ,
    for user_id, list_article_id in df.groupby(by = ['user_id'])['article_id']:,
        # It does not matter how often a user has interacted with an article,
        article_ids = list(set(list_article_id)),
        #print(article_ids),
        #print([1430.0, 1314.0]),
        user_item.loc[user_id, article_ids] = 1,
        #print(user_item.loc[user_id]),
        ,
    ,
    return user_item # return the user_item matrix ,
,
user_item = create_user_item_matrix(df)

SyntaxError: invalid syntax (<ipython-input-203-3170d7d7d567>, line 4)

In [186]:
tmp = df_q3

In [197]:
(df_q3.HaveWorkedLanguage == '[]').sum()

0

In [230]:
# HAVE WORKED LANGUAGE - Extract available values
# convert all 'HaveWorkedLanguage' entries into lists: df_q3.HaveWorkedLanguage.str.split(';')
df_q3['HaveWorkedLanguageList'] = df_q3.HaveWorkedLanguage.str.split(';')
#flatten all lists to one
prog_languages = set()
for entry in df_q3.HaveWorkedLanguageList:
    [prog_languages.add(elem.strip()) for elem in entry]
# convert available programming languages in columns
print(df_q3.columns)
df_q3[list(prog_languages)] = pd.DataFrame(data = np.zeros((df_q3.shape[0],len(prog_languages))), columns = list(prog_languages))
print(df_q3.columns)

Index(['HaveWorkedLanguage', 'OtherPeoplesCode', 'EnjoyDebugging',
       'HaveWorkedLanguageList'],
      dtype='object')
Index(['HaveWorkedLanguage', 'OtherPeoplesCode', 'EnjoyDebugging',
       'HaveWorkedLanguageList', 'Erlang', 'Visual Basic 6', 'Haskell', 'Perl',
       'Go', 'Matlab', 'Clojure', 'VB.NET', 'C#', 'C', 'CoffeeScript', 'Ruby',
       'Python', 'R', 'PHP', 'Lua', 'Rust', 'Assembly', 'Hack', 'VBA', 'Swift',
       'F#', 'Dart', 'Julia', 'Java', 'C++', 'Smalltalk', 'JavaScript',
       'Common Lisp', 'TypeScript', 'SQL', 'Groovy', 'Scala', 'Objective-C',
       'Elixir'],
      dtype='object')


In [231]:
for idx, entry in df_q3.HaveWorkedLanguageList.iteritems():
    print(idx)
    #print(entry)
    df_q3.loc[df_q3.index[idx], entry[0]] = 1
    #df_q3.iloc[idx,entry[0]] = 1
    break
    
print(df_q3.JavaScript)
    #[prog_languages.add(elem.strip()) for elem in entry]

0
0        1.0
1        0.0
2        0.0
3        0.0
4        0.0
5        0.0
6        0.0
7        0.0
8        0.0
9        0.0
10       0.0
11       0.0
12       0.0
13       0.0
14       0.0
15       0.0
16       0.0
17       0.0
18       0.0
19       0.0
20       0.0
21       0.0
22       0.0
23       0.0
24       0.0
25       0.0
26       0.0
27       0.0
28       0.0
29       0.0
        ... 
10073    0.0
10074    0.0
10075    0.0
10076    0.0
10077    0.0
10078    0.0
10079    0.0
10080    0.0
10081    0.0
10082    0.0
10083    0.0
10084    0.0
10085    0.0
10086    0.0
10087    0.0
10088    0.0
10089    0.0
10090    0.0
10091    0.0
10092    0.0
10093    0.0
10094    0.0
10095    0.0
10096    0.0
10097    0.0
10098    0.0
10099    0.0
10100    0.0
10101    0.0
10102    0.0
Name: JavaScript, Length: 10103, dtype: float64


#### Cleaning for question 4: How many people, who program in Python, follow the PEP8 guidlines and use spaces instead of tabs?


In [None]:
def get_top_articles(n, df=df):
    '''
    INPUT:
    n - (int) the number of top articles to return
    df - (pandas dataframe) df as defined at the top of the notebook 
    
    OUTPUT:
    top_articles - (list) A list of the top 'n' article titles 
    
    '''
    # Your code here
    
    return top_articles # Return the top article titles from df (not df_content)

def get_top_article_ids(n, df=df):
    '''
    INPUT:
    n - (int) the number of top articles to return
    df - (pandas dataframe) df as defined at the top of the notebook 
    
    OUTPUT:
    top_articles - (list) A list of the top 'n' article titles 
    
    '''
    # Your code here
 
    return top_articles # Return the top article ids

In [None]:
print(get_top_articles(10))
print(get_top_article_ids(10))

In [None]:
# Test your function by returning the top 5, 10, and 20 articles
top_5 = get_top_articles(5)
top_10 = get_top_articles(10)
top_20 = get_top_articles(20)

# Test each of your three lists from above
t.sol_2_test(get_top_articles)

### <a class="anchor" id="User-User">Part III: User-User Based Collaborative Filtering</a>


`1.` Use the function below to reformat the **df** dataframe to be shaped with users as the rows and articles as the columns.  

* Each **user** should only appear in each **row** once.


* Each **article** should only show up in one **column**.  


* **If a user has interacted with an article, then place a 1 where the user-row meets for that article-column**.  It does not matter how many times a user has interacted with the article, all entries where a user has interacted with an article should be a 1.  


* **If a user has not interacted with an item, then place a zero where the user-row meets for that article-column**. 

Use the tests to make sure the basic structure of your matrix matches what is expected by the solution.

In [None]:
# create the user-article matrix with 1's and 0's

def create_user_item_matrix(df):
    '''
    INPUT:
    df - pandas dataframe with article_id, title, user_id columns
    
    OUTPUT:
    user_item - user item matrix 
    
    Description:
    Return a matrix with user ids as rows and article ids on the columns with 1 values where a user interacted with 
    an article and a 0 otherwise
    '''
    # Fill in the function here
    
    return user_item # return the user_item matrix 

user_item = create_user_item_matrix(df)

In [None]:
## Tests: You should just need to run this cell.  Don't change the code.
assert user_item.shape[0] == 5149, "Oops!  The number of users in the user-article matrix doesn't look right."
assert user_item.shape[1] == 714, "Oops!  The number of articles in the user-article matrix doesn't look right."
assert user_item.sum(axis=1)[1] == 36, "Oops!  The number of articles seen by user 1 doesn't look right."
print("You have passed our quick tests!  Please proceed!")

`2.` Complete the function below which should take a user_id and provide an ordered list of the most similar users to that user (from most similar to least similar).  The returned result should not contain the provided user_id, as we know that each user is similar to him/herself. Because the results for each user here are binary, it (perhaps) makes sense to compute similarity as the dot product of two users. 

Use the tests to test your function.

In [None]:
def find_similar_users(user_id, user_item=user_item):
    '''
    INPUT:
    user_id - (int) a user_id
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    similar_users - (list) an ordered list where the closest users (largest dot product users)
                    are listed first
    
    Description:
    Computes the similarity of every pair of users based on the dot product
    Returns an ordered
    
    '''
    # compute similarity of each user to the provided user

    # sort by similarity

    # create list of just the ids
   
    # remove the own user's id
       
    return most_similar_users # return a list of the users in order from most to least similar
        

In [None]:
# Do a spot check of your function
print("The 10 most similar users to user 1 are: {}".format(find_similar_users(1)[:10]))
print("The 5 most similar users to user 3933 are: {}".format(find_similar_users(3933)[:5]))
print("The 3 most similar users to user 46 are: {}".format(find_similar_users(46)[:3]))

`3.` Now that you have a function that provides the most similar users to each user, you will want to use these users to find articles you can recommend.  Complete the functions below to return the articles you would recommend to each user. 

In [None]:
def get_article_names(article_ids, df=df):
    '''
    INPUT:
    article_ids - (list) a list of article ids
    df - (pandas dataframe) df as defined at the top of the notebook
    
    OUTPUT:
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the title column)
    '''
    # Your code here
    
    return article_names # Return the article names associated with list of article ids


def get_user_articles(user_id, user_item=user_item):
    '''
    INPUT:
    user_id - (int) a user id
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    article_ids - (list) a list of the article ids seen by the user
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the doc_full_name column in df_content)
    
    Description:
    Provides a list of the article_ids and article titles that have been seen by a user
    '''
    # Your code here
    
    return article_ids, article_names # return the ids and names


def user_user_recs(user_id, m=10):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of recommendations for the user
    
    Description:
    Loops through the users based on closeness to the input user_id
    For each user - finds articles the user hasn't seen before and provides them as recs
    Does this until m recommendations are found
    
    Notes:
    Users who are the same closeness are chosen arbitrarily as the 'next' user
    
    For the user where the number of recommended articles starts below m 
    and ends exceeding m, the last items are chosen arbitrarily
    
    '''
    # Your code here
    
    return recs # return your recommendations for this user_id    

In [None]:
# Check Results
get_article_names(user_user_recs(1, 10)) # Return 10 recommendations for user 1

In [None]:
# Test your functions here - No need to change this code - just run this cell
assert set(get_article_names(['1024.0', '1176.0', '1305.0', '1314.0', '1422.0', '1427.0'])) == set(['using deep learning to reconstruct high-resolution audio', 'build a python app on the streaming analytics service', 'gosales transactions for naive bayes model', 'healthcare python streaming application demo', 'use r dataframes & ibm watson natural language understanding', 'use xgboost, scikit-learn & ibm watson machine learning apis']), "Oops! Your the get_article_names function doesn't work quite how we expect."
assert set(get_article_names(['1320.0', '232.0', '844.0'])) == set(['housing (2015): united states demographic measures','self-service data preparation with ibm data refinery','use the cloudant-spark connector in python notebook']), "Oops! Your the get_article_names function doesn't work quite how we expect."
assert set(get_user_articles(20)[0]) == set(['1320.0', '232.0', '844.0'])
assert set(get_user_articles(20)[1]) == set(['housing (2015): united states demographic measures', 'self-service data preparation with ibm data refinery','use the cloudant-spark connector in python notebook'])
assert set(get_user_articles(2)[0]) == set(['1024.0', '1176.0', '1305.0', '1314.0', '1422.0', '1427.0'])
assert set(get_user_articles(2)[1]) == set(['using deep learning to reconstruct high-resolution audio', 'build a python app on the streaming analytics service', 'gosales transactions for naive bayes model', 'healthcare python streaming application demo', 'use r dataframes & ibm watson natural language understanding', 'use xgboost, scikit-learn & ibm watson machine learning apis'])
print("If this is all you see, you passed all of our tests!  Nice job!")

`4.` Now we are going to improve the consistency of the **user_user_recs** function from above.  

* Instead of arbitrarily choosing when we obtain users who are all the same closeness to a given user - choose the users that have the most total article interactions before choosing those with fewer article interactions.


* Instead of arbitrarily choosing articles from the user where the number of recommended articles starts below m and ends exceeding m, choose articles with the articles with the most total interactions before choosing those with fewer total interactions. This ranking should be  what would be obtained from the **top_articles** function you wrote earlier.

In [None]:
def get_top_sorted_users(user_id, df=df, user_item=user_item):
    '''
    INPUT:
    user_id - (int)
    df - (pandas dataframe) df as defined at the top of the notebook 
    user_item - (pandas dataframe) matrix of users by articles: 
            1's when a user has interacted with an article, 0 otherwise
    
            
    OUTPUT:
    neighbors_df - (pandas dataframe) a dataframe with:
                    neighbor_id - is a neighbor user_id
                    similarity - measure of the similarity of each user to the provided user_id
                    num_interactions - the number of articles viewed by the user - if a u
                    
    Other Details - sort the neighbors_df by the similarity and then by number of interactions where 
                    highest of each is higher in the dataframe
     
    '''
    # Your code here
    
    return neighbors_df # Return the dataframe specified in the doc_string


def user_user_recs_part2(user_id, m=10):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of recommendations for the user by article id
    rec_names - (list) a list of recommendations for the user by article title
    
    Description:
    Loops through the users based on closeness to the input user_id
    For each user - finds articles the user hasn't seen before and provides them as recs
    Does this until m recommendations are found
    
    Notes:
    * Choose the users that have the most total article interactions 
    before choosing those with fewer article interactions.

    * Choose articles with the articles with the most total interactions 
    before choosing those with fewer total interactions. 
   
    '''
    # Your code here
    
    return recs, rec_names

In [None]:
# Quick spot check - don't change this code - just use it to test your functions
rec_ids, rec_names = user_user_recs_part2(20, 10)
print("The top 10 recommendations for user 20 are the following article ids:")
print(rec_ids)
print()
print("The top 10 recommendations for user 20 are the following article names:")
print(rec_names)

`5.` Use your functions from above to correctly fill in the solutions to the dictionary below.  Then test your dictionary against the solution.  Provide the code you need to answer each following the comments below.

In [None]:
### Tests with a dictionary of results

user1_most_sim = # Find the user that is most similar to user 1 
user131_10th_sim = # Find the 10th most similar user to user 131

In [None]:
## Dictionary Test Here
sol_5_dict = {
    'The user that is most similar to user 1.': user1_most_sim, 
    'The user that is the 10th most similar to user 131': user131_10th_sim,
}

t.sol_5_test(sol_5_dict)

`6.` If we were given a new user, which of the above functions would you be able to use to make recommendations?  Explain.  Can you think of a better way we might make recommendations?  Use the cell below to explain a better method for new users.

**Provide your response here.**

`7.` Using your existing functions, provide the top 10 recommended articles you would provide for the a new user below.  You can test your function against our thoughts to make sure we are all on the same page with how we might make a recommendation.

In [None]:
new_user = '0.0'

# What would your recommendations be for this new user '0.0'?  As a new user, they have no observed articles.
# Provide a list of the top 10 article ids you would give to 
new_user_recs = # Your recommendations here



In [None]:
assert set(new_user_recs) == set(['1314.0','1429.0','1293.0','1427.0','1162.0','1364.0','1304.0','1170.0','1431.0','1330.0']), "Oops!  It makes sense that in this case we would want to recommend the most popular articles, because we don't know anything about these users."

print("That's right!  Nice job!")

### <a class="anchor" id="Content-Recs">Part IV: Content Based Recommendations (EXTRA - NOT REQUIRED)</a>

Another method we might use to make recommendations is to perform a ranking of the highest ranked articles associated with some term.  You might consider content to be the **doc_body**, **doc_description**, or **doc_full_name**.  There isn't one way to create a content based recommendation, especially considering that each of these columns hold content related information.  

`1.` Use the function body below to create a content based recommender.  Since there isn't one right answer for this recommendation tactic, no test functions are provided.  Feel free to change the function inputs if you decide you want to try a method that requires more input values.  The input values are currently set with one idea in mind that you may use to make content based recommendations.  One additional idea is that you might want to choose the most popular recommendations that meet your 'content criteria', but again, there is a lot of flexibility in how you might make these recommendations.

### This part is NOT REQUIRED to pass this project.  However, you may choose to take this on as an extra way to show off your skills.

In [None]:
def make_content_recs():
    '''
    INPUT:
    
    OUTPUT:
    
    '''

`2.` Now that you have put together your content-based recommendation system, use the cell below to write a summary explaining how your content based recommender works.  Do you see any possible improvements that could be made to your function?  Is there anything novel about your content based recommender?

### This part is NOT REQUIRED to pass this project.  However, you may choose to take this on as an extra way to show off your skills.

**Write an explanation of your content based recommendation system here.**

`3.` Use your content-recommendation system to make recommendations for the below scenarios based on the comments.  Again no tests are provided here, because there isn't one right answer that could be used to find these content based recommendations.

### This part is NOT REQUIRED to pass this project.  However, you may choose to take this on as an extra way to show off your skills.

In [None]:
# make recommendations for a brand new user


# make a recommendations for a user who only has interacted with article id '1427.0'



### <a class="anchor" id="Matrix-Fact">Part V: Matrix Factorization</a>

In this part of the notebook, you will build use matrix factorization to make article recommendations to the users on the IBM Watson Studio platform.

`1.` You should have already created a **user_item** matrix above in **question 1** of **Part III** above.  This first question here will just require that you run the cells to get things set up for the rest of **Part V** of the notebook. 

In [None]:
# Load the matrix here
user_item_matrix = pd.read_pickle('user_item_matrix.p')

In [None]:
# quick look at the matrix
user_item_matrix.head()

`2.` In this situation, you can use Singular Value Decomposition from [numpy](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.linalg.svd.html) on the user-item matrix.  Use the cell to perform SVD, and explain why this is different than in the lesson.

In [None]:
# Perform SVD on the User-Item Matrix Here

u, s, vt = # use the built in to get the three matrices

**Provide your response here.**

`3.` Now for the tricky part, how do we choose the number of latent features to use?  Running the below cell, you can see that as the number of latent features increases, we obtain a lower error rate on making predictions for the 1 and 0 values in the user-item matrix.  Run the cell below to get an idea of how the accuracy improves as we increase the number of latent features.

In [None]:
num_latent_feats = np.arange(10,700+10,20)
sum_errs = []

for k in num_latent_feats:
    # restructure with k latent features
    s_new, u_new, vt_new = np.diag(s[:k]), u[:, :k], vt[:k, :]
    
    # take dot product
    user_item_est = np.around(np.dot(np.dot(u_new, s_new), vt_new))
    
    # compute error for each prediction to actual value
    diffs = np.subtract(user_item_matrix, user_item_est)
    
    # total errors and keep track of them
    err = np.sum(np.sum(np.abs(diffs)))
    sum_errs.append(err)
    
    
plt.plot(num_latent_feats, 1 - np.array(sum_errs)/df.shape[0]);
plt.xlabel('Number of Latent Features');
plt.ylabel('Accuracy');
plt.title('Accuracy vs. Number of Latent Features');

`4.` From the above, we can't really be sure how many features to use, because simply having a better way to predict the 1's and 0's of the matrix doesn't exactly give us an indication of if we are able to make good recommendations.  Instead, we might split our dataset into a training and test set of data, as shown in the cell below.  

Use the code from question 3 to understand the impact on accuracy of the training and test sets of data with different numbers of latent features. Using the split below: 

* How many users can we make predictions for in the test set?  
* How many users are we not able to make predictions for because of the cold start problem?
* How many articles can we make predictions for in the test set?  
* How many articles are we not able to make predictions for because of the cold start problem?

In [None]:
df_train = df.head(40000)
df_test = df.tail(5993)

def create_test_and_train_user_item(df_train, df_test):
    '''
    INPUT:
    df_train - training dataframe
    df_test - test dataframe
    
    OUTPUT:
    user_item_train - a user-item matrix of the training dataframe 
                      (unique users for each row and unique articles for each column)
    user_item_test - a user-item matrix of the testing dataframe 
                    (unique users for each row and unique articles for each column)
    test_idx - all of the test user ids
    test_arts - all of the test article ids
    
    '''
    # Your code here
    
    return user_item_train, user_item_test, test_idx, test_arts

user_item_train, user_item_test, test_idx, test_arts = create_test_and_train_user_item(df_train, df_test)

In [None]:
# Replace the values in the dictionary below
a = 662 
b = 574 
c = 20 
d = 0 


sol_4_dict = {
    'How many users can we make predictions for in the test set?': # letter here, 
    'How many users in the test set are we not able to make predictions for because of the cold start problem?': # letter here, 
    'How many articles can we make predictions for in the test set?': # letter here,
    'How many articles in the test set are we not able to make predictions for because of the cold start problem?': # letter here
}

t.sol_4_test(sol_4_dict)

`5.` Now use the **user_item_train** dataset from above to find U, S, and V transpose using SVD. Then find the subset of rows in the **user_item_test** dataset that you can predict using this matrix decomposition with different numbers of latent features to see how many features makes sense to keep based on the accuracy on the test data. This will require combining what was done in questions `2` - `4`.

Use the cells below to explore how well SVD works towards making predictions for recommendations on the test data.  

In [None]:
# fit SVD on the user_item_train matrix
u_train, s_train, vt_train = # fit svd similar to above then use the cells below

In [None]:
# Use these cells to see how well you can use the training 
# decomposition to predict on test data

`6.` Use the cell below to comment on the results you found in the previous question. Given the circumstances of your results, discuss what you might do to determine if the recommendations you make with any of the above recommendation systems are an improvement to how users currently find articles? 

**Your response here.**

<a id='conclusions'></a>
### Extras
Using your workbook, you could now save your recommendations for each user, develop a class to make new predictions and update your results, and make a flask app to deploy your results.  These tasks are beyond what is required for this project.  However, from what you learned in the lessons, you certainly capable of taking these tasks on to improve upon your work here!


## Conclusion

> Congratulations!  You have reached the end of the Recommendations with IBM project! 

> **Tip**: Once you are satisfied with your work here, check over your report to make sure that it is satisfies all the areas of the [rubric](https://review.udacity.com/#!/rubrics/2322/view). You should also probably remove all of the "Tips" like this one so that the presentation is as polished as possible.


## Directions to Submit

> Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations! 

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Recommendations_with_IBM.ipynb'])