## Question III: Is computer science background affect the working language?

It is a well-known fact that it is not necessary to study computer science to be successful in the software industry. But I am wondering whether there is a significant difference in programming language preferences between computer scientists and others.

To answer this one, we will look to  the usage ratio of languages between computer science graduates and others.

In [2]:
import pandas as pd

### Gather data

In [17]:
#read from file
df_2018 = pd.read_csv("./data/survey_results_public_2018.csv", low_memory=False)
df_2019 = pd.read_csv("./data/survey_results_public_2019.csv")
df_2020 = pd.read_csv("./data/survey_results_public_2020.csv")
df_2021 = pd.read_csv("./data/survey_results_public_2021.csv")

df_merged1 = pd.concat([df_2018, df_2019], sort=False)
df_merged2 = pd.concat([df_2020, df_2021], sort=False)
df = pd.concat([df_merged1, df_merged2], sort=False)

In [45]:
#Create a combined colum which gather LanguageWorkedWith from 2018, 2019 and 2020; LanguageHaveWorkedWith from 2021
idx = len(df_2021)
remaining_index = len(df) - idx 
new_column = df["LanguageWorkedWith"][0:remaining_index].values.tolist()
new_column.extend(df["LanguageHaveWorkedWith"][remaining_index:].values.tolist())

In [47]:
df["WorkedLanguage"] = new_column

### Clean

In [48]:
df.dropna(subset=["UndergradMajor", "WorkedLanguage"], inplace=True)

For making analysis, we need to clean all the rows where "UndergradMajor" or "WorkedLanguage" is

In [49]:
df.reset_index(inplace=True)

In [50]:
df["UndergradMajor"].value_counts()

Computer science, computer engineering, or software engineering          68950
Another engineering discipline (ex. civil, electrical, mechanical)        9430
Information systems, information technology, or system administration     8366
A natural science (ex. biology, chemistry, physics)                       4557
Mathematics or statistics                                                 4173
Web development or web design                                             3742
A business discipline (ex. accounting, finance, marketing)                2653
A humanities discipline (ex. literature, history, philosophy)             2272
A social science (ex. anthropology, psychology, political science)        2006
Fine arts or performing arts (ex. graphic design, music, studio art)      1658
I never declared a major                                                  1109
A health science (ex. nursing, pharmacy, radiology)                        409
Name: UndergradMajor, dtype: int64

In [51]:
cs_df = df[df["UndergradMajor"] == "Computer science, computer engineering, or software engineering"]

In [52]:
len(df)

109325

In [53]:
cs_df.shape

(68950, 239)

In [54]:
other_df = df.drop(index=cs_df.index)

In [55]:
other_df.shape

(40375, 239)

### Analyse

In [3]:
"""
returns the ratio of language through the given dataframe and column
input:
    df: dataframe to search
    col: column name to calculate frequency
output:
    langs_unified: dictionary that keeps lang name and frequency
"""




def find_frequency(df:pd.DataFrame, col:str):
    df_length = len(df)
    langs = df[col].value_counts()
    #find occurances
    langs_unified = {}
    for l in langs.keys():
        splitted = l.split(";")
        for s in splitted:
            if s in langs_unified.keys():
                langs_unified[s] += langs[l]
            else:
                langs_unified[s] = langs[l]
                
                
    #find frequency
    for k,v in langs_unified.items():
        langs_unified[k] = v/df_length
    return langs_unified

In [71]:
cs_langs_unified = find_frequency(cs_df, "WorkedLanguage")

In [72]:
cs_langs_unified

{'Java': 0.497897026831037,
 'C#': 0.362291515591008,
 'JavaScript': 0.699303843364757,
 'SQL': 0.577577955039884,
 'HTML': 0.4145757795503988,
 'CSS': 0.39546047860768674,
 'TypeScript': 0.20845540246555475,
 'PHP': 0.27286439448876,
 'HTML/CSS': 0.24018854242204496,
 'Objective-C': 0.07186366932559826,
 'Swift': 0.08292965917331399,
 'Python': 0.38281363306744015,
 'Kotlin': 0.062175489485134156,
 'Bash/Shell': 0.24274111675126903,
 'Bash/Shell/PowerShell': 0.14529369108049311,
 'C++': 0.2625670775924583,
 'VB.NET': 0.04130529369108049,
 'C': 0.2326178390137781,
 'Scala': 0.04951414068165337,
 'Ruby': 0.09469180565627267,
 'Groovy': 0.03208121827411167,
 'Other(s):': 0.032777374909354604,
 'VBA': 0.039985496736765776,
 'Delphi/Object Pascal': 0.015068890500362581,
 'Go': 0.08008701957940537,
 'Visual Basic 6': 0.02384336475707034,
 'Assembly': 0.07232777374909355,
 'CoffeeScript': 0.019332849891225527,
 'R': 0.04436548223350254,
 'F#': 0.013633067440174038,
 'Dart': 0.007425670775924

In [73]:
others_langs_unified = find_frequency(other_df, "WorkedLanguage")

In [74]:
others_langs_unified

{'JavaScript': 0.6806191950464396,
 'PHP': 0.28752941176470587,
 'SQL': 0.5580928792569659,
 'HTML': 0.4063405572755418,
 'CSS': 0.3851145510835913,
 'C#': 0.2846315789473684,
 'HTML/CSS': 0.2541919504643963,
 'Python': 0.4035170278637771,
 'Java': 0.3231950464396285,
 'Bash/Shell': 0.2468359133126935,
 'TypeScript': 0.1726811145510836,
 'R': 0.0919876160990712,
 'Bash/Shell/PowerShell': 0.1465015479876161,
 'Objective-C': 0.05060061919504644,
 'Swift': 0.05996284829721362,
 'C++': 0.18989473684210526,
 'VB.NET': 0.03918266253869969,
 'Kotlin': 0.03529411764705882,
 'C': 0.1698328173374613,
 'Ruby': 0.10204334365325077,
 'Other(s):': 0.04002476780185758,
 'VBA': 0.06912693498452012,
 'CoffeeScript': 0.021770897832817337,
 'Delphi/Object Pascal': 0.015455108359133126,
 'Groovy': 0.021671826625386997,
 'Matlab': 0.04421052631578947,
 'Scala': 0.035343653250773995,
 'Perl': 0.026278637770897832,
 'Assembly': 0.04861919504643963,
 'Go': 0.06855727554179566,
 'F#': 0.0116656346749226,
 'Vis

In [89]:
res_df = pd.DataFrame(data=list(cs_langs_unified.values()), index=list(cs_langs_unified.keys()))

In [90]:
res_df["others"] = res_df.index.map(others_langs_unified)

In [93]:
res_df.index.name = "Language"

In [94]:
res_df.columns=["CS", "others"]

In [96]:
res_df.head()

Unnamed: 0_level_0,CS,others
Language,Unnamed: 1_level_1,Unnamed: 2_level_1
Java,0.497897,0.323195
C#,0.362292,0.284632
JavaScript,0.699304,0.680619
SQL,0.577578,0.558093
HTML,0.414576,0.406341


In [97]:
res_df["diff_major_vals"] = res_df["CS"] - res_df["others"]

### Visualise

In [101]:
res_df.style.bar(subset=['diff_major_vals'], align='mid', color=['#d65f5f', '#5fba7d'])

Unnamed: 0_level_0,CS,others,diff_major_vals
Language,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Java,0.497897,0.323195,0.174702
C#,0.362292,0.284632,0.0776599
JavaScript,0.699304,0.680619,0.0186846
SQL,0.577578,0.558093,0.0194851
HTML,0.414576,0.406341,0.00823522
CSS,0.39546,0.385115,0.0103459
TypeScript,0.208455,0.172681,0.0357743
PHP,0.272864,0.287529,-0.014665
HTML/CSS,0.240189,0.254192,-0.0140034
Objective-C,0.0718637,0.0506006,0.0212631


As we in the table, the programming languages which are more low-level such as Java, C++, and C are used by computer scientists. There is over a 5% difference in preference of these listed languages. This may be because computer science or engineering programs teach these ones within their curriculum. And these are near to computer hardware (you can manage the memory using pointers etc.).