![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Module 2 Unit 3  - DData Sources

### Data Sources

As we previously mentioned, many organizations such as post-secondary or private research institutions, scientific organizations, and governmental agencies make data they have collected available for public use.

Data sets published under an open license can be used for all kinds of things, including as a teaching tool or to conduct research unrelated to its original purpose.

### 📚 Read
>[Reuser's Guide to Open Data Licensing](https://theodi.org/article/reusers-guide-to-open-data-licensing/)

One way to find data sets is through a simple internet search. Google has created a specialized search tool for this very purpose, the [Google Dataset Search.](https://datasetsearch.research.google.com/)

Let's take a closer look at some different types of data sources.

### Private data sources

Companies that provide online services such as Alphabet (Google), Amazon, Apple, Twitter, and Meta, collect vast amounts of data about users and their online activities.

These companies use the data to improve their services and develop new products, but also sell access to much of this data to other organizations for advertising, research, and marketing purposes.

Data from private companies is often not freely available for educational purposes, however there are exceptions. The social media platform Twitter allows anyone to search and download posts made in the last week, and provides a variety of filters to help tailor the results to their needs. 

That said, people pulling this data are only permitted access to a very limited subset, and must use an application programming interface (API), but this free access is noteworthy and used quite extensively by social science researchers, such as the Social Media Lab, a research laboratory at Ryerson University. 

There are a variety of free tools available that can help people interested in using social media for research access and download data from Twitter and other platforms.

### 📚 Read
>[Social media data in research: a review of the current landscape.](https://ocean.sagepub.com/blog/social-media-data-in-research-a-review-of-the-current-landscape) This short 2019 article by Lily Davies, a Digital Humanities masters student at UCL, summarizes some of the tools used to scrape data from social media platforms.


### Government data sources

Governments are increasingly making an effort to provide public access to data they have collected. 

Data that is freely available to be used, shared, and built on is referred to as **open data.** In many cases, this data is also structured to be machine readable and is accompanied with documentation about the format and metadata regarding how the data was collected and intended to be used.

The Government of Canada, Statistics Canada, the provincial and territorial governments, and even many municipalities have open data portals where anyone can find data sets created as part of government projects.

**Explore**
>[Open Government Programs in Canada](https://open.canada.ca/en/maps/open-data-canada#toc1) is an interactive map of the various open data portals around the country.

**Explore**
>[Major Smart Cities with Open Data](https://rlist.io/l/major-smart-cities-with-open-data-portals) is a list of cities around the world with open data portals. 

### Academic data sources

Post-secondary institutions generate a lot of valuable research data, and thanks to the UNESCO recommendation on Open Science, are increasingly making their data sets available to the public in formats that allow them to be explored, shared, and expanded upon. These efforts include:

* [OpenDOAR](http://v2.sherpa.ac.uk/opendoar/), a global directory of open access repositories.
* [Re3Data](https://www.re3data.org/), an online registry of research data repositories.
* [Figshare](https://figshare.com/), a another repository where users can make all of their research outputs available in a citable, shareable and discoverable manner.
* [Dryad](https://datadryad.org/stash), a community-owned and curated research data resource.

### Non-profit data sources

Rich data sets are also made available by other sources including non-profit organizations. Some of the non-profit organizations sharing open data sets include:

* [Gapminder](https://www.gapminder.org/)
* [Billion Prices Project](http://www.thebillionpricesproject.com/)
* [Pew research](https://www.pewresearch.org/download-datasets/)
* [The World Bank](https://data.worldbank.org/)
* [The United Nations](https://data.un.org/)
* [The United Nations Peacekeeping](https://opendata.unesco.org/)
* [UNESCO](https://core.unesco.org/)
* [The World Health Organization Global Health Observatory](https://www.who.int/data/gho)

Overall, the challenging part is often finding the relevant data source which is what makes data set aggregators like the Google Dataset search so valuable.

### Generating our own data sets

As previously mentioned, reusing existing data sets can often be faster and easier than creating our own. However some methods of data collection are reasonable for use in a classroom setting.

Web scraping involves using automated tools to gather information from webpages and convert it into a format that is convenient for data analysis. 

For example, we could use this method to gather data related to NHL hockey teams and individual player performance records.

![hockey](../_images/Module2-Unit3-image4.jpeg)

Scraping live website data can be technically challenging, so we won't be exploring these methods in this course. However, for those teachers and students who are interested in learning how to do this on their own, the CodeAcademy article linked below provides more information.

### 📚 Read (Optional)
>[Web Scraping MLB Stats with Python and Beautiful Soup](https://news.codecademy.com/web-scraping-python-beautiful-soup-mlb-stats/)



### 🏁 Actvity

* What’s your favourite hobby? Can you find a data set associated with it (preferably an open data set)?

*Hint: Try [Google Dataset Search](https://datasetsearch.research.google.com/)*

* OR based on your location, can you find the nearest [government](https://open.canada.ca/en/maps/open-data-canada#toc1) open data set that’s relevant to you? Within that data repository, can you find a data set that interests you?


### Conclusion

In this unit, we learned about some of the resources teachers and students can use to access data and different types of use licenses.

Open data portals let us explore real data that is relevant to our lives and is more interesting to explore than outdated or made-up examples.

However there is so much out there that it can be hard to choose a data set for use in the classroom.

In the next unit, we'll dive deeper into what makes a data set good for classroom analysis and data science in general.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)