In the recent century, in a lot of countries, women have finally gained more rights and we have been continuously progressing towards a more equal society. Our goal in this project is to try to highlight the evolution of social gender inequalities in different domains through the ages. Our interest is on the achievements and recognition of the work done by women in different fields as far back as the data goes. We will use the data from Wikipedia and gather data on number of women referenced, their contribution to their domain, and other parameters and compare it with the same data for men. We will also try to find how the country of origin and the time of acquisition of the rights has an significant impact.
- Can we accurately use wikipedia database to show the gender inequalities through time?
- Are there any evidence that the men/women equality is reached?
- What are the domains in which there is more/less equality? Does this change according to regions/country/language?
- Since some countries have delayed women's rights, is the evolution similar, in term of timeframe, extent?
Wikipedia Data: wikimedia dumps https://dumps.wikimedia.org/ The data dump is in XML which can be parsed using existing tools. They can also be imported in SQL for easier data querying. There are already some existing projects on github that use he same dataset that we could use to guide us for the data analysis. We will filter the pages to extract a "list" of the influential people divided into categories corresponding to what they are famous for. Then we can divide the data by gender, nationality and the period they lived in for further conclusions.
11.11
- Download the required data
- Undestard how to use the cluster to manipulate our data
- Understand the structure of the data from wikipedia
13.11
- Sort the data so as to keep only what is usefull to us
- Clean the data and convert it to an easily usable format
- Define what parameters we will use to quantify gender equality
20.11
- Analyze the data collected
- Think about the best visualisation for the data
23.11
- Cleanup the code and proof reading the report.
01.12
- Extract all human entities from wikidata with json script
- Clean this data to repeat analysis
- Start to redo the analysis more in depth with new data
08.12
- Finish analysis and clean the code
- start the report
16.12
- Finish report
Florian: First part to show differences across cultures. Emile: Parsing the JSON and analysis across fields of work. We will both work for the final poster before ML4