# How do you make a difference from within the Berlin tech scene? 

- Life is too short to be spent in front of a computer if that time is not bringing significant value to other people's lives. For me personally, 
- I've meandered through various professional environments which, although at times have been fulfilling, have ultimately had very little positive impact on the world. 
- On the hunt for my next tech job, I want to take a different approach this time: I want to expand my search drastically to find out just which company will suit me best.
- I want to understand the job market in Berlin (the city in which I reside) so that I can, with some degree of confidence, say I'm doing the best I can from within this city. 

## A few guiding questions:
- What programming, data or just generally tech roles are out there?
- What different sectors exist out there (e.g. mobility, e-commerce, rental market etc.)?
- What are the average salaries for different roles?
- What companies currently have a positive impact?
- What companies provide the best salaries?
- What companies provide the best working environments?
- What does it mean to be a senior developer? How does one get there?
- What skill set is most in demand and in which fields? 

## Data sources:
- Scraping various job-posting websites (e.g. Glassdoor, LinkedIn)
- https://berlinstartupjobs.com/
- https://www.gehalt.de/
- https://berlinvalley.com/
- https://www.tagesspiegel.de/themen/gruenderzeit/

## Data collection
### Glassdoor

In [3]:
import pandas as pd
import re

#Read in the latest job scrape
df = pd.read_json('output/glassdoor-jobs_2021-06-13_12:11:49.json')

#Remove the column name prefixes (artefacts from scraping process)
def strip_pref(x):
    return x[len('data_jobView_header_'):] if 'data_jobView_header_' in x else x[len('data_jobView_'):]
df.columns = [strip_pref(p) for p in df.columns]

#Just pull out the most basic columns for now, we're gonna focus on the job_description and do some NLP
#The other values we scraped are kinda useless
column_names = [
    'ageInDays',
    'employer_name',
    'employer_size',
    'expired',
    'locationName',
    'normalizedJobTitle',
    'job_description',
    'job_listingId'
]
df = df[column_names]

#And normalise the column names once more into snake_case
pattern = re.compile(r'(?<!^)(?=[A-Z])')
df.columns = [pattern.sub('_', name).lower() for name in df.columns]

#Use the job_listing_id as an index (TODO - will have to watch out for this if we add a more sources)
df = df.set_index('job_listing_id')
df

Unnamed: 0_level_0,age_in_days,employer_name,employer_size,expired,location_name,normalized_job_title,job_description
job_listing_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
4086941415,2,"Amazon.com, Inc.",Mehr als 10.000 Mitarbeiter,False,Berlin,Lagerarbeiter (M/W/D),"<p>Amazon sucht motivierte Produktionshelfer, ..."
4086170144,3,"HubSpot, Inc.",1.001 bis 5.000 Mitarbeiter,False,Berlin,Account Executive,Who we are<br/><br/>HubSpot is the world's lea...
4107740389,4,Takeaway.com,1.001 bis 5.000 Mitarbeiter,False,Berlin,Junior Inside Sales Representative (M/W/D),<div><div><p>T&auml;glich bestellen Hunderttau...
4107558203,4,Sirius Facilities GmbH,201 bis 500 Mitarbeiter,False,Berlin,Junior Business Analyst (M/W/D),"<div><p><b>&Uuml;BER UNS</b></p><p>Wir, die Si..."
4085768549,4,WOW Tech International GmbH,201 bis 500 Mitarbeiter,False,Berlin,E-commerce Manager,<div><div><div><b>WHO WE ARE?</b></div><div></...
...,...,...,...,...,...,...,...
4107482231,4,IU Internationale Hochschule GmbH,201 bis 500 Mitarbeiter,False,Berlin,Cloud Engineer,<div><p>Grow with us - Start your career at IU...
4094031472,22,ManpowerGroup Inc.,Mehr als 10.000 Mitarbeiter,False,Berlin,Recruiter (M/W/D),<div>Recruiter - Helfergesch&auml;ft f&uuml;r ...
4007869663,1,Engineering People GmbH,201 bis 500 Mitarbeiter,False,Berlin,Projektmanager (M/W/D),<div>Aufgaben:<ul><li>F&uuml;hren eines Entwic...
4101202365,12,Cobalt Company,1 bis 50 Mitarbeiter,False,Berlin,Recruiter (M/W/D),<div><p>&Uuml;ber Cobalt</p><p>Cobalt ist die ...


## How many junior positions are there?

In [9]:
contains_jr = df['job_description'].str.contains(r'Amazon')
df[contains_jr]

Unnamed: 0_level_0,age_in_days,employer_name,employer_size,expired,location_name,normalized_job_title,job_description
job_listing_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
4086941415,2,"Amazon.com, Inc.",Mehr als 10.000 Mitarbeiter,False,Berlin,Lagerarbeiter (M/W/D),"<p>Amazon sucht motivierte Produktionshelfer, ..."
4086941415,2,"Amazon.com, Inc.",Mehr als 10.000 Mitarbeiter,False,Berlin,Lagerarbeiter (M/W/D),"<p>Amazon sucht motivierte Produktionshelfer, ..."
4086941415,2,"Amazon.com, Inc.",Mehr als 10.000 Mitarbeiter,False,Berlin,Lagerarbeiter (M/W/D),"<p>Amazon sucht motivierte Produktionshelfer, ..."
4086941415,2,"Amazon.com, Inc.",Mehr als 10.000 Mitarbeiter,False,Berlin,Lagerarbeiter (M/W/D),"<p>Amazon sucht motivierte Produktionshelfer, ..."
4086941415,2,"Amazon.com, Inc.",Mehr als 10.000 Mitarbeiter,False,Berlin,Lagerarbeiter (M/W/D),"<p>Amazon sucht motivierte Produktionshelfer, ..."
4086941415,2,"Amazon.com, Inc.",Mehr als 10.000 Mitarbeiter,False,Berlin,Lagerarbeiter (M/W/D),"<p>Amazon sucht motivierte Produktionshelfer, ..."
4086941415,2,"Amazon.com, Inc.",Mehr als 10.000 Mitarbeiter,False,Berlin,Lagerarbeiter (M/W/D),"<p>Amazon sucht motivierte Produktionshelfer, ..."
4086941415,2,"Amazon.com, Inc.",Mehr als 10.000 Mitarbeiter,False,Berlin,Lagerarbeiter (M/W/D),"<p>Amazon sucht motivierte Produktionshelfer, ..."
4086941415,2,"Amazon.com, Inc.",Mehr als 10.000 Mitarbeiter,False,Berlin,Lagerarbeiter (M/W/D),"<p>Amazon sucht motivierte Produktionshelfer, ..."
4086941415,2,"Amazon.com, Inc.",Mehr als 10.000 Mitarbeiter,False,Berlin,Lagerarbeiter (M/W/D),"<p>Amazon sucht motivierte Produktionshelfer, ..."
