# IBM Data Analyst Capstone Project
by Alex Paquette

This project is completed as part of IBM's Data Analyst Professional Certificate. It demonstrates my ability to go through the entire Data Analyst process, which is listed below:

1. Data Collection
2. Data Wrangling
3. Exploratory Data Analysis
4. Data Visualization



Install required libraries using `pip`

In [277]:
%pip install pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Import required libraries

In [278]:
import pandas as pd

# Data Collection

There are a variety of methods that we can use to collect data. We can gather data via APIs, web scraping, and file reading (excel, csv, json, etc.). For this project, we will be using a .csv file provided by Coursera. It can be obtained from the following URL:

https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/VYPrOu0Vs3I0hKLLjiPGrA/survey-data-with-duplicate.csv

We will be using this dataset to go through the Data Analysis process, starting with data collection.

## About the Dataset

Before we go any further, it's important to understand what attributes we're working with. This will help us during the Data Wrangling stage and will inform how we handle duplicates and missing values.

The dataset in question are responses from the **Stack Overflow Developer Survey**.

### Column Description for Survey Data
| Column name | Question text |
| --- | --- |
| ResponseId | Randomized respondent ID number|
| MainBranch | Which of the following options best describes you today?|
| Age | What is your age?|
| Employment | What is your current employment status? |
| RemoteWork | How often do you work remotely? |
| Check | Check Various verification or check questions related to survey consistency |
| CodingActivities | What coding activities do you engage in? |
| EdLevel | What is the highest level of formal education you have completed? |
| LearnCode | How did you learn to code? |
| LearnCodeOnline | Have you used online resources to learn coding? |
| TechDoc | How do you use technical documentation? |
| YearsCode | How many years have you been coding? |
| DevType | What is your role or type of development you do? |
| OrgSize | What is the size of the organization you work for? |
| PurchaseInfluence | How much influence do you have on purchasing technology at your company? |
| BuyNewTool | How does your company decide whether to buy new tools or technology? |
| BuildvsBuy | Does your company prefer to build or buy software? |
| TechEndorse | Do you endorse any specific technologies at your company? |
| Country | In which country do you reside? |
| Currency | Which currency do you use day-to-day? |
| CompTotal | What is your current total compensation? |
| LanguageHaveWorkedWith | Which programming languages have you worked with in the past year? |
| LanguageWantToWorkWith | Which programming languages do you want to work with in the future? |
| LanguageAdmired | Which programming languages do you admire most? |
| DatabaseHaveWorkedWith | Which database technologies have you worked with in the past year? |
| DatabaseWantToWorkWith | Which database technologies do you want to work with in the future? |
| DatabaseAdmired | Which database technologies do you admire most? |
| PlatformHaveWorkedWith | Which platforms have you worked with in the past year? |
| PlatformWantToWorkWith | Which platforms do you want to work with in the future? |
| PlatformAdmired | Which platforms do you admire most? |
| WebframeHaveWorkedWith | Which web frameworks have you worked with in the past year? |
| WebframeWantToWorkWith | Which web frameworks do you want to work with in the future? |
| WebframeAdmired | Which web frameworks do you admire most? |
| EmbeddedHaveWorkedWith | Which embedded systems have you worked with in the past year? |
| EmbeddedWantToWorkWith | Which embedded systems do you want to work with in the future? |
| EmbeddedAdmired | Which embedded systems do you admire most? |
| MiscTechHaveWorkedWith | Which miscellaneous technologies have you worked with in the past year? |
| MiscTechWantToWorkWith | Which miscellaneous technologies do you want to work with in the future? |
| MiscTechAdmired | Which miscellaneous technologies do you admire most? |
| SOVisitFreq | How frequently do you visit Stack Overflow? |
| SOPartFreq | How often do you participate in Q&A on Stack Overflow? |
| AISelect | How do you feel about artificial intelligence tools for development? |
| AIBen | What benefits have you experienced from using AI tools? |
| AIChallenges | What challenges have you faced while using AI tools? |
| JobSat | How satisfied are you with your current job? |

This is quite a comprehensive list of questions, and far too many for us to work with in the scope of this project. We will be selecting which attributes we want to keep for our anaysis and which we want to drop.

We will use the `pandas.read_csv()` method to read the provided CSV file and store it into a `DataFrame`

In [279]:
file_path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/VYPrOu0Vs3I0hKLLjiPGrA/survey-data-with-duplicate.csv'
df = pd.read_csv(file_path) #read csv into a dataframe

We can use the `head()` function to see the first five records of this dataset. This gives us confidence that the data was loaded properly.

In [280]:
df.head()

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
0,1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,...,,,,,,,,,,
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,,,,,,,Appropriate in length,Easy,,
3,4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,...,,,,,,,Too long,Easy,,
4,5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Too short,Easy,,


# Data Wrangling

This step involves cleaning and preparing the dataset to make it ready for our analysis. We will perform the following tasks:

1. Identify and remove duplicate rows.
2. Drop unecessary columns
3. Standardize our data
4. Analyze missing values
5. Normalize data for comparative analysis

## Identify and remove duplicate rows

First, let's identify how many duplicates we have in our dataset.

We'll be narrowing our search to the ResponseId, as this is the primary attribute that can help us identify duplicates, since there should only be one id per survey response.

In [281]:
# Let's identify how many duplicates we have in our dataset
dupe_count = df['ResponseId'].duplicated().sum()
print(f"Number of duplicates: {dupe_count}")

Number of duplicates: 20


This tells us there are 20 duplicates in our dataset. Now we want to remove our duplicates from our dataset. We're keeping the first instance and dropping any others that appear in our dataset by setting `keep` to `first`.

In [282]:
df.drop_duplicates(subset=['ResponseId'], keep='first', inplace=True)

We can count our duplicates again to be sure they've been removed.

In [283]:
dupe_count = df['ResponseId'].duplicated().sum()
print(f"Number of duplicates: {dupe_count}")

Number of duplicates: 0


And sure enough, we see there are no duplicates!

## Drop unecessary columns

Before we move any further, it's a good idea to properly go through each attribute and determine whether we want to keep it for analytical reasons, or drop it to simplify our dataset.

Looking at our survey questions from earlier, there are a few that stand out as being unecessary for our purposes:

### MainBranch
This attribute has the following response options:
- I am a developer by profession
- I am learning to code
- I code primarily as a hobby
- I am not primarily a developer, but I write code sometimes...
- I used to be a developer by profession, but I no long am

The information provided by this attribute can be gleamed by others, such as `Employment`, `EdLevel`, and `YearsCode` to name a few.

Here are a few other attributes we'll be dropping as they're not useful to our analysis:
- Check
- CodingActivities
- LearnCode
- LearnCodeOnline
- TechDoc
- TechEndorse
- MiscTechHaveWorkedWith
- MiscTechWantToWorkWith
- MiscTechAdmired
- SOVisitFreq
- SOPartFreq
- AISelect
- AIBen

To drop columns, we use the `.drop()` method on our dataframe and we set `axis=1` to specify we're droping columns. `axis=0` is for dropping rows.

In [284]:
drop_columns = ['MainBranch', 'Check', 'CodingActivities', 'LearnCode', 'LearnCodeOnline', 
                'TechDoc', 'TechEndorse', 'MiscTechHaveWorkedWith', 'MiscTechWantToWorkWith', 
                'MiscTechAdmired', 'SOVisitFreq', 'SOPartFreq', 'AISelect', 'AIBen']
df.drop(drop_columns, axis=1, inplace=True)

We can also see here that there are a multitude of unknown columns not mentioned. We should drop most of these since we don't have any information about what they represent. Some can be inferred and are useful, such as `ConvertedCompYearly`, which seems to represent common compensation value.

In [285]:
df.columns

Index(['ResponseId', 'Age', 'Employment', 'RemoteWork', 'EdLevel', 'YearsCode',
       'YearsCodePro', 'DevType', 'OrgSize', 'PurchaseInfluence', 'BuyNewTool',
       'BuildvsBuy', 'Country', 'Currency', 'CompTotal',
       'LanguageHaveWorkedWith', 'LanguageWantToWorkWith', 'LanguageAdmired',
       'DatabaseHaveWorkedWith', 'DatabaseWantToWorkWith', 'DatabaseAdmired',
       'PlatformHaveWorkedWith', 'PlatformWantToWorkWith', 'PlatformAdmired',
       'WebframeHaveWorkedWith', 'WebframeWantToWorkWith', 'WebframeAdmired',
       'EmbeddedHaveWorkedWith', 'EmbeddedWantToWorkWith', 'EmbeddedAdmired',
       'ToolsTechHaveWorkedWith', 'ToolsTechWantToWorkWith',
       'ToolsTechAdmired', 'NEWCollabToolsHaveWorkedWith',
       'NEWCollabToolsWantToWorkWith', 'NEWCollabToolsAdmired',
       'OpSysPersonal use', 'OpSysProfessional use',
       'OfficeStackAsyncHaveWorkedWith', 'OfficeStackAsyncWantToWorkWith',
       'OfficeStackAsyncAdmired', 'OfficeStackSyncHaveWorkedWith',
       'Office

In [286]:
#ToolsTechHaveWorkedWith -> SurveyEase
additional_drop_cols = df.loc[:, "ToolsTechHaveWorkedWith":'SurveyEase'].columns
df.drop(additional_drop_cols, axis=1, inplace=True)
df.drop('YearsCodePro', axis=1, inplace=True)

In [287]:
df.columns

Index(['ResponseId', 'Age', 'Employment', 'RemoteWork', 'EdLevel', 'YearsCode',
       'DevType', 'OrgSize', 'PurchaseInfluence', 'BuyNewTool', 'BuildvsBuy',
       'Country', 'Currency', 'CompTotal', 'LanguageHaveWorkedWith',
       'LanguageWantToWorkWith', 'LanguageAdmired', 'DatabaseHaveWorkedWith',
       'DatabaseWantToWorkWith', 'DatabaseAdmired', 'PlatformHaveWorkedWith',
       'PlatformWantToWorkWith', 'PlatformAdmired', 'WebframeHaveWorkedWith',
       'WebframeWantToWorkWith', 'WebframeAdmired', 'EmbeddedHaveWorkedWith',
       'EmbeddedWantToWorkWith', 'EmbeddedAdmired', 'ConvertedCompYearly',
       'JobSat'],
      dtype='str')

Now we have a more manageable set of attributes to work with.

## Standardize our Data

Before we go and handle for missing values, we want to go through our categorical attributes and standardize where necessary. First, we need to identify which columns are categorical.


In [288]:
# str data types are categorical
categorical = [var for var in df.columns if df[var].dtype=='str']
categorical

['Age',
 'Employment',
 'RemoteWork',
 'EdLevel',
 'YearsCode',
 'DevType',
 'OrgSize',
 'PurchaseInfluence',
 'BuyNewTool',
 'BuildvsBuy',
 'Country',
 'Currency',
 'LanguageHaveWorkedWith',
 'LanguageWantToWorkWith',
 'LanguageAdmired',
 'DatabaseHaveWorkedWith',
 'DatabaseWantToWorkWith',
 'DatabaseAdmired',
 'PlatformHaveWorkedWith',
 'PlatformWantToWorkWith',
 'PlatformAdmired',
 'WebframeHaveWorkedWith',
 'WebframeWantToWorkWith',
 'WebframeAdmired',
 'EmbeddedHaveWorkedWith',
 'EmbeddedWantToWorkWith',
 'EmbeddedAdmired']

In [289]:
df['DevType'].unique()

<StringArray>
[                                            nan,
                         'Developer, full-stack',
                          'Developer Experience',
                                       'Student',
                           'Academic researcher',
                               'Project manager',
                            'Developer Advocate',
                           'Developer, back-end',
                       'Other (please specify):',
                          'Developer, front-end',
                        'Database administrator',
 'Developer, desktop or enterprise applications',
                 'Cloud infrastructure engineer',
 'Data scientist or machine learning specialist',
                   'Research & Development role',
   'Developer, embedded applications or devices',
                          'System administrator',
                             'DevOps specialist',
                           'Engineering manager',
                                    

Next we want to go through each one and identify the unique values. For many, no changes will be necessary, but for some we'll want to standardize the values to either be easier to read or even condense some categories into a single one. For the purpose of this notebook I will only show the attributes that needed standardizing.

### Employment

The employment attribute has 110 unique entires. However this is deceiving, the respondents were allowed to select multiple options. This means there are fewer Employment categories than what is being returned by the `unique.()` method. We need to break this down further by string splitting our results and returning unique options again.

In [290]:
for type in df['Employment'].unique():
    print(type)

Employed, full-time
Student, full-time
Student, full-time;Not employed, but looking for work
Independent contractor, freelancer, or self-employed
Not employed, and not looking for work
Employed, full-time;Student, part-time
Employed, full-time;Independent contractor, freelancer, or self-employed
Employed, full-time;Student, full-time
Employed, part-time
Student, full-time;Employed, part-time
Student, part-time;Employed, part-time
I prefer not to say
Not employed, but looking for work
Student, part-time
Employed, full-time;Student, full-time;Independent contractor, freelancer, or self-employed;Employed, part-time
Employed, full-time;Independent contractor, freelancer, or self-employed;Student, part-time
Independent contractor, freelancer, or self-employed;Employed, part-time
Independent contractor, freelancer, or self-employed;Student, part-time;Employed, part-time
Student, full-time;Not employed, but looking for work;Independent contractor, freelancer, or self-employed
Student, full-ti

Now that we've broken down our employment types further, we see that we can simplify them. There are multiple types of employment listed, but we only really care about these for our purposes:
- Employed
- Student
- Not employed
- Independent contractor
- I prefer not to say
- Retired

In [291]:
Employment_categories = df['Employment'].str.split(';').explode('Employment').unique()
Employment_categories

<StringArray>
[                                 'Employed, full-time',
                                   'Student, full-time',
                   'Not employed, but looking for work',
 'Independent contractor, freelancer, or self-employed',
               'Not employed, and not looking for work',
                                   'Student, part-time',
                                  'Employed, part-time',
                                  'I prefer not to say',
                                              'Retired']
Length: 9, dtype: str

At first glance this seems simple. We can strip everything after the comma in the response. However, it's more complicated than that. The respondents were allowed to select multiple options. This means that we need a way to prioritize some responses over another.

Here is the order of priority I have selected:
1. Retired
2. Student
3. Employed
4. Independent contractor
5. Not employed
6. I prefer not to say

For each entry we need to do the following:
- Split each employment type using the semicolon
- Only pull the value before the comma
- Select an entry based on order of priority.

In [292]:
def simplify_employment(response: str) -> str | None:
    if pd.isna(response): return None

    PRIORITY = ['Retired', 'Student', 'Employed', 'Independent contractor', 'Not employed', 'I prefer not to say']

    selections = set()

    # simplify entry
    for selection in response.split(';'):
        base_value = selection.split(',')[0]
        selections.add(base_value)

    for category in PRIORITY:
        if category in selections:
            return category.title()
    

In [293]:
df['Employment'] = df['Employment'].apply(simplify_employment)

In [294]:
df['Employment'].unique()

<StringArray>
[              'Employed',                'Student', 'Independent Contractor',
           'Not Employed',    'I Prefer Not To Say',                'Retired']
Length: 6, dtype: str

And just like that, we've simplified our Employment category!

## RemoteWork

For this attribute, we just want simplify the `Hybrid (some remote, some in-person)` entry to `Hybrid` for readability.

In [295]:
df['RemoteWork'].unique()

<StringArray>
['Remote', nan, 'In-person', 'Hybrid (some remote, some in-person)']
Length: 4, dtype: str

In [296]:
df['RemoteWork'] = df['RemoteWork'].replace('Hybrid (some remote, some in-person)', 'Hybrid')

In [297]:
df['RemoteWork'].unique()

<StringArray>
['Remote', nan, 'In-person', 'Hybrid']
Length: 4, dtype: str

## EdLevel
For this attribute, we want to simplify the entries for readability, similar to what we did for `Hybrid` with RemoteWork.

In [298]:
df['EdLevel'].unique()

<StringArray>
[                                                         'Primary/elementary school',
                                       'Bachelor’s degree (B.A., B.S., B.Eng., etc.)',
                                    'Master’s degree (M.A., M.S., M.Eng., MBA, etc.)',
                             'Some college/university study without earning a degree',
 'Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)',
                                     'Professional degree (JD, MD, Ph.D, Ed.D, etc.)',
                                                'Associate degree (A.A., A.S., etc.)',
                                                                     'Something else',
                                                                                  nan]
Length: 9, dtype: str

In [None]:
edu_map = {
    'Primary/elementary school': 'Elementary',
    'Bachelor’s degree (B.A., B.S., B.Eng., etc.)': 'Bachelor\'s',
    'Master’s degree (M.A., M.S., M.Eng., MBA, etc.)': 'Master\'s',
    'Some college/university study without earning a degree': 'Some post-secondary',
    'Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)': 'Secondary',
    'Professional degree (JD, MD, Ph.D, Ed.D, etc.)': 'Professional degree',
    'Associate degree (A.A., A.S., etc.)': 'Associate degree'
}
df['EdLevel'] = df['EdLevel'].map(edu_map)

In [300]:
df['EdLevel'].unique()

<StringArray>
[         'Elementary',          'Bachelor's',            'Master's',
 'Some post-secondary',           'Secondary', 'Professional degree',
    'Associate degree',                   nan]
Length: 8, dtype: str

## YearsCode
For this attribute, we actually want to convert it to integer values, since all but two of these values are the actual number of years. First, we need to convert `Less than 1 year` and `More than 50 years` to a number value. We'll assign `1` to the former and `50` to the latter.

In [301]:
df['YearsCode'].unique()

<StringArray>
[                 nan,                 '20',                 '37',
                  '4',                  '9',                 '10',
                  '7',                  '1',                 '15',
                 '30',                 '31',                  '6',
                 '12',                 '22',                  '5',
                 '36',                 '25',                 '44',
                 '24',                 '18',                  '3',
                  '8', 'More than 50 years',                 '11',
                 '29',                 '40',                 '39',
                  '2',                 '42',                 '34',
                 '19',                 '35',                 '16',
                 '33',                 '13',                 '23',
                 '14',                 '28',                 '17',
                 '21',                 '43',                 '46',
                 '26',                 '32',    

In [302]:
df['YearsCode'] = df['YearsCode'].replace('More than 50 years', '50')
df['YearsCode'] = df['YearsCode'].replace('Less than 1 year', '1')

In [303]:
df['YearsCode'].unique()

<StringArray>
[ nan, '20', '37',  '4',  '9', '10',  '7',  '1', '15', '30', '31',  '6', '12',
 '22',  '5', '36', '25', '44', '24', '18',  '3',  '8', '50', '11', '29', '40',
 '39',  '2', '42', '34', '19', '35', '16', '33', '13', '23', '14', '28', '17',
 '21', '43', '46', '26', '32', '41', '45', '27', '38', '48', '47', '49']
Length: 51, dtype: str

Now that all values except nan are numbers, we want to actually convert the years into integer values. Because we still have NaN values, we need to use `to_numeric()` with `coerce` error handling.

In [304]:
df['YearsCode'] = pd.to_numeric(df['YearsCode'], errors='coerce').astype('Int64')

We'll handle the `NaN` values in the next section, as we've now converted this column to a numerical category.

## DevType
This column has too many categories and needs to be streamlined somehow. Many of these categories should be folded into one to reduce the number to a manageable one.

In [305]:
df['DevType'].unique()

<StringArray>
[                                            nan,
                         'Developer, full-stack',
                          'Developer Experience',
                                       'Student',
                           'Academic researcher',
                               'Project manager',
                            'Developer Advocate',
                           'Developer, back-end',
                       'Other (please specify):',
                          'Developer, front-end',
                        'Database administrator',
 'Developer, desktop or enterprise applications',
                 'Cloud infrastructure engineer',
 'Data scientist or machine learning specialist',
                   'Research & Development role',
   'Developer, embedded applications or devices',
                          'System administrator',
                             'DevOps specialist',
                           'Engineering manager',
                                    

Having scrolled through, I believe we can condense these categories into the following:
- Developer
- Student
- Researcher
- Manager
- Database administrator
- Engineer
- Data scientist
- System administrator
- DevOps
- Designer
- Security professional
- Data analyst
- Educator
- Scientist
- Blockchain
- Marketing

All the Developer types have been folded into a single type. Likewise Engineers were combined into a single type. 

In [None]:
devType_map = {
    'Primary/elementary school': 'Elementary',
    'Bachelor’s degree (B.A., B.S., B.Eng., etc.)': 'Bachelor\'s',
    'Master’s degree (M.A., M.S., M.Eng., MBA, etc.)': 'Master\'s',
    'Some college/university study without earning a degree': 'Some post-secondary',
    'Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)': 'Secondary',
    'Professional degree (JD, MD, Ph.D, Ed.D, etc.)': 'Professional degree',
    'Associate degree (A.A., A.S., etc.)': 'Associate degree'
}

## Analyze Missing Values

Now that we've dropped columns we don't plan to use, we want to identify missing values in our dataset. We can do this by calling `isna()` on our dataframe and checking the output for `True` values. This will help identify which columns need to have missing values handled.

In [102]:
missing_data = df.isnull()
columns_with_na = []
for column in missing_data.columns.values.tolist():
    series_count = missing_data[column].value_counts()
    if series_count.size > 1:
        count_na = series_count.values[1]
        print(f"{column}: {count_na}")

RemoteWork: 10631
EdLevel: 4653
YearsCode: 5568
YearsCodePro: 13827
DevType: 5992
OrgSize: 17957
PurchaseInfluence: 18031
BuyNewTool: 20256
BuildvsBuy: 22079
Country: 6507
Currency: 18753
CompTotal: 31697
LanguageHaveWorkedWith: 5692
LanguageWantToWorkWith: 9685
LanguageAdmired: 14565
DatabaseHaveWorkedWith: 15183
DatabaseWantToWorkWith: 22879
DatabaseAdmired: 26880
PlatformHaveWorkedWith: 23071
PlatformWantToWorkWith: 30905
PlatformAdmired: 31377
WebframeHaveWorkedWith: 20276
WebframeWantToWorkWith: 26902
WebframeAdmired: 30494
EmbeddedHaveWorkedWith: 22214
EmbeddedWantToWorkWith: 17600
EmbeddedAdmired: 16733
ToolsTechHaveWorkedWith: 12955
ToolsTechWantToWorkWith: 19353
ToolsTechAdmired: 21440
NEWCollabToolsHaveWorkedWith: 7845
NEWCollabToolsWantToWorkWith: 13350
NEWCollabToolsAdmired: 14726
OpSysPersonal use: 7263
OpSysProfessional use: 12464
OfficeStackAsyncHaveWorkedWith: 17344
OfficeStackAsyncWantToWorkWith: 26471
OfficeStackAsyncAdmired: 28233
OfficeStackSyncHaveWorkedWith: 9892


That's quite a lot of missing values! We'll need to identify each one and 

# Exploratory Data Analysis

# Data Visualization