Skip to content

thusspokedata/Week-1-Project

 
 

Repository files navigation

The Project

This is the first week group project for the data class from Ironhack, in Berlin, march 2021 The tasks can be found at the end of this readme file. We created a function that pulls data from a Data Job postings database, in order to create a visualization that shows the percentage of job adverts in the database that contains each keyword.

The Team:

[samcana]https://github.com/samcana
[antonio-datahack ]https://github.com/antonio-datahack
[Jennipher K]https://github.com/Jennipher0716
And myself.

Process notes:

  • As a starting point, we played a bit around with the data, checking keywords in full-data-test-mood-check-pre-analysis.ipynb
  • We started the project doing a code along all together over week-1-project-code.ipynb
  • this file creates the graph skills-data-scientist-usa.png, which we used to create a star wars themed deliverable, as this was the theme of the presentations.
  • You can find the deliverable here: https://github.com/Alex-Skp/Week-1-Project/blob/master/onepager-delivery.pdf

As for the cleaning steps and executing of the code:

  • Cleaned dataset to make it easier to find certain keywords in the job descriptions
  • Removed meaningless words from the description, and stored it in lists.
  • Added up all lists in order to look for keywords we find meaningful
  • Decided to focus only in data scientist postings, as they were significantly more numerous than other postings
  • Checked in how many postings the skills we would have or might acquired in the bootcamp are included
  • Calculated the percentage over the total data scientist postings
  • Plotted the result. We didnt spend enough time in the visualization but we will make it look better for the presentation.
  • Check up the notebook: https://github.com/Alex-Skp/Week-1-Project/blob/master/group-project-code.ipynb

A final post-project function:

Ironhack logo

Task : clean the data - summarise your findings in a 'one pager'

Here's your challenge for your first group project!

the deadline for finishing is Monday at noon; I will give you class time to work on this project, and you should submit your one pager via the student portal AND deliver a short group presentation to your classmates.

You will be working with a data set hosted on Kaggle that has been scraped for you from the web about US data science hires in 2018 (ie pre-covid!). The author wanted to look at some specific questions :

Who gets hired? What kind of talent do employers want when they are hiring a data scientist?

Which location has the most opportunities?

What skills, tools, degrees or majors do employers want the most for data scientists?

I think you can do more with this data set to summarise the insights and the process of data wrangling. The data is not easy to work with at the moment. Your main challenge will be to use Python to clean, wrangle and generally reshape the data to make it more straightforward to analyse- to visualise what you find in the data you can either export it to a csv, use excel to chart it, or you can explore the capabilities of Python to plot the data.

You will be in a group (2-3 students) to work on this project; as we are remote this is an opportunity to get to know each-other while applying your recently acquired skills working with messy data. This is your first group project- be reasonable in your expectations of what can be achieved in the timeframe and working with new people!

The insights you find can be documented simply with screenshots of your data frames or downloaded images of charts, but I would like to see these accompanied by some simple annotation/text summarising both what you found AND how easy it was to get to. What we want from each group is a one pager- suitable for an infographic or blog page, describing what you learnt from the data and what the gaps in the data or limitations of it are.

For inspiration on what sort of insights you might look into, you can see the web scraper's blog here : https://nycdatascience.com/blog/student-works/who-gets-hired-an-outlook-of-the-u-s-data-scientist-job-market-in-2018/

Some ideas for working successfully remotely with a group:

  • set up a co-working zoom / slack session

  • have an 'installation party' - getting started with the data all together, bring your own drinks and snacks

  • some of the group could try working primarily with python/pandas, others can try with Excel - and compare what you find

  • split the task among you- maybe some of you are better than presentations, others at pandas or plotting

  • share a digital whiteboard to brainstorm ideas

  • agree a shared communication method eg Telegram / Slack or co work in a zoom break out room

Heres the data we will be working with:

Kaggle data source

HINT : You will need to first download the data as csv file(s)

Expected steps and outcome:

  • You can use the ALL data set you see in the Kaggle link or practice combining the separate files into one data frame

  • employ string functions or REGEX, eg. Like , IF/ELSE to extract common values from strings of different lengths, eg job description

  • insights by any combination of job profile, company, location city, area of the country

  • create new columns as needed to enhance the data source: for example employ Boolean T/F logic to indicate which roles are closest to big financial or software centres in the US

  • make a decision about handling NULLs in the data - fill in values where logical, ignore them or clean them where not

  • any other data cleaning or wrangling tasks you find useful.

  • 'one pager' summary - including insights, commentary, review of how easy the data was to work with and highlighting any limitations you found in the data set. This can be in pdf, slide, word doc etc... this can be as beautiful or as simple as you like. You will be sharing this with your classmates and the teaching team will provide feedback on your submissions. As you effectively have ONLY one page to make your case, you might start by identifying multiple trends and then scale back to focus on just one or two important ones. The main focus of the exercise is on working with messy data, so if you dont find any great data insights, you should feel free to take screen shots of your cleaning procedures and talk about them. One member of the group should host this one pager on git / googledrive / similar and submit the url.

  • a short class presentation (aim for 5 minutes) involving all members of your group to talk through your method and findings.

--- any questions reach out to the LT or TAs

About

Week 1 project for Ironhack

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%