Skip to content

This is a repository of projects that will teach you the most important libraries and concepts used in data science with the help of Python (Pandas, Matlibplot).

Notifications You must be signed in to change notification settings

artursniegowski/Intro_to_Data_Science_Python

Repository files navigation

Intro_to_Data_Science_Python

This is a repository of projects that will teach you the most important libraries and concepts used in data science with the help of Python. The programs below will demonstrate how easy and efficient Python is, and why it is one of the most popular and widely used language in data science. Have fun and enjoy the codding! Based on the course https://www.udemy.com/course/100-days-of-code/ .

01_Analysing_Salaries_of_Graduates_by_Major

Basic operations using the Pandas library for data exploration. As an example, data from salaries_by_college_major.csv was used and explored with Python Library Pandas. In order to have a better understanding and manipulation of data, the Jupyter Notebook was used, which is an interactive computing platform. In this example, the salaries of post-university graduates by major are being analised.

02_Analyse_Popularity_of_Programming_Languages

Each post on Stack Overflow comes with a tag. And this tag can be the name of a programming language. Based on that, we will gather data from stackoverflow and generate a csv file containing information on various programming languages and the number of times each language is tagged in a post. This will help us to determine which programming language is the most popular. The analysis is carried out using the Python Library Pandas, and the results are plotted using the Python Library Matplotlib. 

03_Analysing_Dataset_LEGO_Pieces

The data for LEGO pieces was used from https://rebrickable.com/downloads/ (like colors.csv, sets.csv, and themes.csv). The analysis is carried out using the Python Library Pandas, and the results are plotted using the Python Library Matplotlib. Some examples of how to use a relational database and merge columns were described in this analysis. Some very interesting facts could be found thanks to this analysis, like the most enormous Lego set ever or how many pieces did it have? Or, when were the first LEGO sets released, and how many sets did the company sell when it first opened its doors? Or we can explore which LEGO theme is the most popular. By analysing the data, we can see when the company really took off based on its product offering. We can also answer questions like whether LEGO complexity has changed over time, or which sets tend to have more parts. 

04_Combine_Google_Trends_with_other_Data

This analysis shows how to combine Google trends with web searches from https://trends.google.com/trends in comparison to some other data. The popularity of search terms can tell us a lot about future trends. In this particular example, three main data sets were examined.

  1. Bitcoin search volume in comparison to Bitcoin prices
  2. The relationship between Tesla's stock price and Tesla search volume
  3. Unemployment Rate vs. Unemployment Benefits Search Volume

For getting the bitcoin, the Tesla stock price, https://finance.yahoo.com/quote was used.
For getting the unemployment rate,https://fred.stlouisfed.org/series/UNRATE/ was used.
In order to match the data, resampling of dates and time series was necessary.

05_Analysing_Google_Apps_Plotly

This analysis focuses on data about Google apps scraped from the Google Play Store by Lavanya Gupta in 2018. The original files are available [here] (https://www.kaggle.com/lava18/google-play-store-apps). The main points were depicting interesting facts about Google Apps with the help of Python libry plotly, and using diverse charts, like pie, bar, box, and many more.
The data about Google apps was explored to find out interesting facts like:

  • how competitive various app categories (for example, games, lifestyle, and weather) are,
  • What are the most popular apps,
  • What was the most downloaded app's estimated revenue,
  • how the monetization of an app affects its download count,
  • What is a reasonable price for an app,

06_NumPy_and_N_Dimensional_Arrays

In this project we will explore the NumPy (Numerical Python) Python library, which is used in almost every field of science and engineering. It’s practically the standard for working with numerical data in Python. It is an introduction to get a better understanding of how to work with this library.
Main points, including:

  • how to work with arrays
  • how to create nd.arrays
  • creating arrays with standard functions such as arange(), random(), or linespace()
  • What exactly is broadcasting and how does it work
  • How to do linear algebra with NumPy
  • Image manipulation with a NumPy arrays 

07_Seaborn_and_Linear_Regression

This project will have a focus on analysing data about films. Data was scraped on May 1, 2018 from https://www.the-numbers.com/movie/budgets .In this analysis, we will explore the different aspects of how the budget of a movie influences the revenue, and also how to predict future revenue based on the budget and year it was filmed. Different Python libraries were used. like the visualisation library Seaborn, which is based on Matplotlib for generating different kinds of charts (bubble chart, scatter chart, regressions). as well as the open source data analysis library scikit-learn (the gold standard for machine learning), which was used to calculate the data for linear regression, check how accurate our model is, and make predictions about future revenue.

08_Nobel_Prize_Analysis

In this project, we're going to analyse data (https://www.nobelprize.org/prizes/lists/all-nobel-prizes/) on the past winners of the Nobel Prize. Thanks to this analysis, we will find some interesting facts about the Nobel laureates, like: The ratio of male to female winners, or who was the first to win a Nobel Prize, how many people got a Nobel Prize more than once, how many categories there are and how many prizes there are for each category, the number of Nobel Prizes awarded over time, which countries have the most Nobel Prizes, which cities make the most discoveries, where are the Nobel laureates born? What are the patterns or statistics in the laureates' age at the time of the award?
There were different Python libraries used in order to better visualise the results.  Seaborn, which is based on Matplotlib for generating different kinds of charts (bubble chart, scatter chart, box charts, regressions). Plotly, it's Python graphing library, makes interactive, publication-quality graphs (like sunbursts and choropleths). Matplotlib is a comprehensive library for creating static, animated, and interactive visualisations in Python. Matplotlib makes easy things easy and hard things possible.

09_Hospital_Birth_Deaths_Analysis

In this project, we're going to analyse data (http://graphics8.nytimes.com/images/blogs/freakonomics/pdf/the%20etiology,%20concept%20and%20prophylaxis%20of%20childbed%20fever.pdf) that was collected by Dr. Semmelweis in the 1800s on deaths of women in maternity wards from childbed fever. Some highlights from the analysis include, percentage of women dying in childbirth, visualising the total number of births and deaths over time. And we will also look more closely into the effect of handwashing, calculating the difference in the average monthly death rate based on handwashing and using histograms to visualise the monthly distribution of outcomes (percentage of deaths).
The main libraries used in this analysis were:
Seaborn, which is based on Matplotlib for generating different kinds of charts.
Plotly, it's Python graphing library, makes interactive, publication-quality graphs.
Matplotlib is a comprehensive library for creating static, animated, and interactive visualisations in Python.
SciPy is an open-source software for mathematics, science, and engineering (for calculating the t-statistic and the p-value).

10_Predict_House_Prices

In this project, we will use data from the UCI ML housing dataset, https://archive.ics.uci.edu/ml/machine-learning-databases/housing/. This data set includes 14 characteristics describing the housing market in the Boston area in the 1970s. We will analyse the data, and based on that, we will build a multivariable regression model to predict house prices in this area. We will divide the data set into two categories: one that will be used to train or find the parameters for our multivariable linear regression, and the other that will be used for testing. The price will be the target value, and the rest of the characteristics (13 in total, like CRIM (crime rate), RM (number of rooms), and NOX (pollution)) will be used as factors to determine the price. The features will be analysed and checked to see if they are sufficient for predicting house prices. The main libraries used in this analysis were:
Seaborn, which is based on Matplotlib for generating different kinds of charts.
Plotly, it's Python graphing library, makes interactive, publication-quality graphs.
Matplotlib is a comprehensive library for creating static, animated, and interactive visualisations in Python.
SciPy is an open-source software for mathematics, science, and engineering (for calculating the t-statistic and the p-value).

About

This is a repository of projects that will teach you the most important libraries and concepts used in data science with the help of Python (Pandas, Matlibplot).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published