Intro_to_Data_Science_Python

This is a repository of projects that will teach you the most important libraries and concepts used in data science with the help of Python. The programs below will demonstrate how easy and efficient Python is, and why it is one of the most popular and widely used language in data science. Have fun and enjoy the codding! Based on the course https://www.udemy.com/course/100-days-of-code/ .

01_Analysing_Salaries_of_Graduates_by_Major

Basic operations using the Pandas library for data exploration. As an example, data from salaries_by_college_major.csv was used and explored with Python Library Pandas. In order to have a better understanding and manipulation of data, the Jupyter Notebook was used, which is an interactive computing platform. In this example, the salaries of post-university graduates by major are being analised.

02_Analyse_Popularity_of_Programming_Languages

Each post on Stack Overflow comes with a tag. And this tag can be the name of a programming language. Based on that, we will gather data from stackoverflow and generate a csv file containing information on various programming languages and the number of times each language is tagged in a post. This will help us to determine which programming language is the most popular. The analysis is carried out using the Python Library Pandas, and the results are plotted using the Python Library Matplotlib.

03_Analysing_Dataset_LEGO_Pieces

The data for LEGO pieces was used from https://rebrickable.com/downloads/ (like colors.csv, sets.csv, and themes.csv). The analysis is carried out using the Python Library Pandas, and the results are plotted using the Python Library Matplotlib. Some examples of how to use a relational database and merge columns were described in this analysis. Some very interesting facts could be found thanks to this analysis, like the most enormous Lego set ever or how many pieces did it have? Or, when were the first LEGO sets released, and how many sets did the company sell when it first opened its doors? Or we can explore which LEGO theme is the most popular. By analysing the data, we can see when the company really took off based on its product offering. We can also answer questions like whether LEGO complexity has changed over time, or which sets tend to have more parts.

04_Combine_Google_Trends_with_other_Data

This analysis shows how to combine Google trends with web searches from https://trends.google.com/trends in comparison to some other data. The popularity of search terms can tell us a lot about future trends. In this particular example, three main data sets were examined.

Bitcoin search volume in comparison to Bitcoin prices
The relationship between Tesla's stock price and Tesla search volume
Unemployment Rate vs. Unemployment Benefits Search Volume

For getting the bitcoin, the Tesla stock price, https://finance.yahoo.com/quote was used.
For getting the unemployment rate,https://fred.stlouisfed.org/series/UNRATE/ was used.
In order to match the data, resampling of dates and time series was necessary.

05_Analysing_Google_Apps_Plotly

This analysis focuses on data about Google apps scraped from the Google Play Store by Lavanya Gupta in 2018. The original files are available [here] (https://www.kaggle.com/lava18/google-play-store-apps). The main points were depicting interesting facts about Google Apps with the help of Python libry plotly, and using diverse charts, like pie, bar, box, and many more.
The data about Google apps was explored to find out interesting facts like:

how competitive various app categories (for example, games, lifestyle, and weather) are,
What are the most popular apps,
What was the most downloaded app's estimated revenue,
how the monetization of an app affects its download count,
What is a reasonable price for an app,

06_NumPy_and_N_Dimensional_Arrays

In this project we will explore the NumPy (Numerical Python) Python library, which is used in almost every field of science and engineering. It’s practically the standard for working with numerical data in Python. It is an introduction to get a better understanding of how to work with this library.
Main points, including:

how to work with arrays
how to create nd.arrays
creating arrays with standard functions such as arange(), random(), or linespace()
What exactly is broadcasting and how does it work
How to do linear algebra with NumPy
Image manipulation with a NumPy arrays

07_Seaborn_and_Linear_Regression

This project will have a focus on analysing data about films. Data was scraped on May 1, 2018 from https://www.the-numbers.com/movie/budgets .In this analysis, we will explore the different aspects of how the budget of a movie influences the revenue, and also how to predict future revenue based on the budget and year it was filmed. Different Python libraries were used. like the visualisation library Seaborn, which is based on Matplotlib for generating different kinds of charts (bubble chart, scatter chart, regressions). as well as the open source data analysis library scikit-learn (the gold standard for machine learning), which was used to calculate the data for linear regression, check how accurate our model is, and make predictions about future revenue.

08_Nobel_Prize_Analysis

In this project, we're going to analyse data (https://www.nobelprize.org/prizes/lists/all-nobel-prizes/) on the past winners of the Nobel Prize. Thanks to this analysis, we will find some interesting facts about the Nobel laureates, like: The ratio of male to female winners, or who was the first to win a Nobel Prize, how many people got a Nobel Prize more than once, how many categories there are and how many prizes there are for each category, the number of Nobel Prizes awarded over time, which countries have the most Nobel Prizes, which cities make the most discoveries, where are the Nobel laureates born? What are the patterns or statistics in the laureates' age at the time of the award?
There were different Python libraries used in order to better visualise the results. Seaborn, which is based on Matplotlib for generating different kinds of charts (bubble chart, scatter chart, box charts, regressions). Plotly, it's Python graphing library, makes interactive, publication-quality graphs (like sunbursts and choropleths). Matplotlib is a comprehensive library for creating static, animated, and interactive visualisations in Python. Matplotlib makes easy things easy and hard things possible.

09_Hospital_Birth_Deaths_Analysis

In this project, we're going to analyse data (http://graphics8.nytimes.com/images/blogs/freakonomics/pdf/the%20etiology,%20concept%20and%20prophylaxis%20of%20childbed%20fever.pdf) that was collected by Dr. Semmelweis in the 1800s on deaths of women in maternity wards from childbed fever. Some highlights from the analysis include, percentage of women dying in childbirth, visualising the total number of births and deaths over time. And we will also look more closely into the effect of handwashing, calculating the difference in the average monthly death rate based on handwashing and using histograms to visualise the monthly distribution of outcomes (percentage of deaths).
The main libraries used in this analysis were:
Seaborn, which is based on Matplotlib for generating different kinds of charts.
Plotly, it's Python graphing library, makes interactive, publication-quality graphs.
Matplotlib is a comprehensive library for creating static, animated, and interactive visualisations in Python.
SciPy is an open-source software for mathematics, science, and engineering (for calculating the t-statistic and the p-value).

10_Predict_House_Prices

In this project, we will use data from the UCI ML housing dataset, https://archive.ics.uci.edu/ml/machine-learning-databases/housing/. This data set includes 14 characteristics describing the housing market in the Boston area in the 1970s. We will analyse the data, and based on that, we will build a multivariable regression model to predict house prices in this area. We will divide the data set into two categories: one that will be used to train or find the parameters for our multivariable linear regression, and the other that will be used for testing. The price will be the target value, and the rest of the characteristics (13 in total, like CRIM (crime rate), RM (number of rooms), and NOX (pollution)) will be used as factors to determine the price. The features will be analysed and checked to see if they are sufficient for predicting house prices. The main libraries used in this analysis were:
Seaborn, which is based on Matplotlib for generating different kinds of charts.
Plotly, it's Python graphing library, makes interactive, publication-quality graphs.
Matplotlib is a comprehensive library for creating static, animated, and interactive visualisations in Python.
SciPy is an open-source software for mathematics, science, and engineering (for calculating the t-statistic and the p-value).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intro_to_Data_Science_Python

01_Analysing_Salaries_of_Graduates_by_Major

02_Analyse_Popularity_of_Programming_Languages

03_Analysing_Dataset_LEGO_Pieces

04_Combine_Google_Trends_with_other_Data

05_Analysing_Google_Apps_Plotly

06_NumPy_and_N_Dimensional_Arrays

07_Seaborn_and_Linear_Regression

08_Nobel_Prize_Analysis

09_Hospital_Birth_Deaths_Analysis

10_Predict_House_Prices

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
01_Analysing_Salaries_of_Graduates_by_Major		01_Analysing_Salaries_of_Graduates_by_Major
02_Analyse_Popularity_of_Programming_Languages		02_Analyse_Popularity_of_Programming_Languages
03_Analysing_Dataset_LEGO_Pieces		03_Analysing_Dataset_LEGO_Pieces
04_Combine_Google_Trends_with_other_Data		04_Combine_Google_Trends_with_other_Data
05_Analysing_Google_Apps_Plotly		05_Analysing_Google_Apps_Plotly
06_NumPy_and_N_Dimensional_Arrays		06_NumPy_and_N_Dimensional_Arrays
07_Seaborn_and_Linear_Regression		07_Seaborn_and_Linear_Regression
08_Nobel_Prize_Analysis		08_Nobel_Prize_Analysis
09_Hospital_Birth_Deaths_Analysis		09_Hospital_Birth_Deaths_Analysis
10_Predict_House_Prices		10_Predict_House_Prices
.gitignore		.gitignore
README.md		README.md

artursniegowski/Intro_to_Data_Science_Python

Folders and files

Latest commit

History

Repository files navigation

Intro_to_Data_Science_Python

01_Analysing_Salaries_of_Graduates_by_Major

02_Analyse_Popularity_of_Programming_Languages

03_Analysing_Dataset_LEGO_Pieces

04_Combine_Google_Trends_with_other_Data

05_Analysing_Google_Apps_Plotly

06_NumPy_and_N_Dimensional_Arrays

07_Seaborn_and_Linear_Regression

08_Nobel_Prize_Analysis

09_Hospital_Birth_Deaths_Analysis

10_Predict_House_Prices

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages