Skip to content

federicomariamassari/udacity-dand

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Udacity Data Analyst Nanodegree

My Udacity Data Analyst Nanodegree projects. In Python, unless otherwise stated.

Link to syllabus

Associated repositories: Data Analyst Nanodegree: Career Development

Featured project: Explore what I consider my best project to date!

Completion time: 5 days

Supporting lesson content: Introduction to Python Programming (Beginner)

Acquired familiarity with: Python

Link to project specification

Overview

Analyse data from Ford GoBike, a bike sharing company in the San Francisco Bay Area, using Python. Clean the dataset, create visualisations from the wrangled data, uncover and explore trends.

This is a rework of the original introductory project, to complete within the first week of enrolment in the Data Analyst Nanodegree program. Using Pandas, Seaborn, and Basemap I explore two years of daily trip data across five US cities (Mountain View, Palo Alto, Redwood City, San Francisco, and San Jose), and for two types of users (annual subscribers and customers).

What was the biggest challenge?

Although I had just finished the Python introductory course, I found the provided code quite difficult to grasp. To read and write text files, for example, the assignment made extensive use of the csv module (which I would later review for P2 and P3), in place of the more intuitive Pandas. The dataset, however, was extremely interesting, so digging into it was definitely an enriching experience.

Which part of the code do you like best?

Revealing that a small group of annual subscribers, possibly unaware of overtime fees, keep the bike for an entire working day; discovering that Ford GoBike, in accordance with my findings, decided to drop the program in three out of five cities.

Completion time: 5 days

Supporting lesson content: Intro to Descriptive Statistics (Beginner)

Acquired familiarity with: NumPy, SciPy, Matplotlib

Overview

In this project, you will demonstrate your knowledge of descriptive statistics by conducting an experiment dealing with drawing from a deck of playing cards and creating a write up containing your findings.

This is a practical application of the central limit theorem. I generate a deck of cards, draw randomly from it, analyse the distribution of outcomes, compute basic statistics and, if applicable, provide confidence intervals for the population mean when only sample moments are available, as well as the cumulative distribution function for a random variable X, F(x) = P(Xx), the probability of it being below a certain threshold value x.

What was the biggest challenge?

Dealing with the replacement option, which impacts on the random card drawing algorithm. I took advantage of Python's definition of set and list as collections of, respectively, unique and repeatable objects. For replacement == False, I initialised the hand to the empty set, so that no two identical cards could be drawn; I also limited the maximum number of cards to pick to 52. For replacement == True, instead, I initialised the hand to the empty list, so that the same card could be drawn multiple times; I also removed the 52-card constraint.

Which part of the code do you like best?

The deck generating process. In particular, the way suits and values are stored in different variables, then randomly paired by the card drawing algorithm. I believe this procedure is very intuitive and neat.

Link to Python module

Completion time: 21 days

Supporting lesson content: Intro to Data Analysis (Beginner)

Textbook: Jake VanderPlas, Python Data Science Handbook. Essential Tools for Working with Data, O'Reilly Media, 2016

Acquired familiarity with: Pandas, Seaborn

Link to project specification

Overview

Choose one of Udacity's curated datasets and investigate it using NumPy and Pandas. Go through the entire data analysis process, starting by posing a question and finishing by sharing your findings.

Digging into the Titanic dataset, I go through all the steps involved in a typical data analysis process: formulate questions, wrangle (acquire and clean data), explore, draw conclusions, and communicate findings. I mainly use Pandas to store and handle data in tables, SciPy to detect statistical association among variables, and Seaborn to produce plots.

What was the biggest challenge?

Performing meaningful tests of association between binary variables and plotting the results, offering enough plot variety. Binary variables are challenging for two reasons: on one hand, they make the correlation coefficient difficult to interpret; on the other, very few plot types are suitable to visualise their relationship. To measure the degree of association between variables, I resorted to contingency tables, phi coefficients (binary-to-binary) and Cramér's V (nominal-to-binary). To display such association, I experimented with various Seaborn plots: swarm, strip, violin, and joint, among the others.

Which part of the code do you like best?

Function association, which returns phi coefficients (if applicable), Cramér's V, and the result of Pearson's test of independence. For the first two statistics no built-in Python function was available, so I had to define my own.

Link to Python module

Languages: Python, SQL

Completion time: 45 days

Supporting lesson content: Data Wrangling with MongoDB (Intermediate)

Acquired familiarity with: ElementTree XML API, SQLite

Link to project specification

Overview

Choose an area of the world you care about in OpenStreetMap.org and use data munging techniques to clean the related OSM file (XML). Import the file into a SQL or MongoDB database and run queries to inspect the cleaned data.

This project focuses on the wrangling step of the data analysis process: auditing (using regular expressions) and cleaning an XML document, writing its updated entries on csv files, creating a SQL database and importing the files in it, querying the database, and, finally, producing a report. This is where my Python skills definitely leveled up!

What was the biggest challenge?

Understanding the structure of the provided version of data.py was surprisingly hard, and I spent several days doing this task. The most difficult part was grasping the syntax of xml.etree.cElementTree, the Python module to parse the OSM document. However, after this obstacle was overcome, the remaining steps of the process—polishing the XML elements, importing the data into SQL, exploring the database, and producing the pdf report—were thoroughly enjoyable!

Which part of the code do you like best?

Map generation via the Basemap toolkit. My first experiment with the module, a scatter plot of postal codes in the OSM file on top of a 2D chart of Milan and its surroundings, made me really ecstatic!

Link to Python modules

Languages: Python, R

Completion time:

Supporting lesson content: Data Analysis with R (Intermediate)

Acquired familiarity with: Requests, Beautiful Soup, R

Link to project specification

Overview

Choose one of Udacity's curated datasets or find one of your own and perform a complete exploratory data analysis using R: investigate relationships in one variable to multiple variables, distributions, outliers, and anomalies.

This project focuses on the exploration step of the data analysis process: revealing meaningful links among variables through assorted and compelling visualisations.

What was the biggest challenge?

Choosing the right dataset—one rich enough to be explored along many dimensions—as well as becoming familiar with the R language, proved two pretty challenging tasks. In the first case, I decided to (ethically) scrape my own data using Python libraries Requests and Beautiful Soup. In the second, although the R syntax was not immediate to understand, I caught up quickly thanks to my experience with other programming languages.

Which part of the code do you like best?

The map of the principal two-country co-production relationships. The plot, modelled after this one, took me three days to complete (as it involved a complex data tidying procedure), but the result was extremely pleasing.

Link to Python modules

Link to R module

Releases

No releases published

Packages

No packages published