## Structure of the lecture on modeling qualitative data with MTA (MQD)

In this lecture, we are learning how to modeling qualitative data based on texts. The lecture covers the most important steps, beginning with the installation of the software needed for the lecture and ending with the publication of the results in several forms. 

The lecture's structure develops along the following key steps:

  1. Lecture 1: theoretical background: what is the modeling of qualitative data and which software why will be used in the lecture
  2. Lecture 2: install the software (Anaconda; Linux on a usb-key -- install and postinstall --, python3)
  3. Lecture 3 and 4: data driven preprocessing -- gathering text data and basic conversion; parsing text data in several format (pdf, doc·x/odt/rtf, x·htm·l, txt etc.) using low level programming utilities
  4. Lecture 5: using MTA to model your data -- the topic modelling way, basic requirements and first analysis
  5. Lecture 6: automation of MTA, in-depth interpretation of the results generated by MTA (in-room lecture)
  6. Lecture 7: MTA plots, csv to generate your own plots the static way, usage of jupyter as collaborative platform for this purpose (in-room lecture)
  7. Lecture 8: Communicate on your results -- simple widgets or how to give your results some interactivity
  8. Lecture 9: Dashboards and Graphs -- web-app, topic networks and semantic chains (standalone example)
  
The lecture has been organized as a mixed of several materials, combining jupyter notebooks with code snippets that you can run within the notebook, video material covering the usage of some of the software used in this lecture, as well as some exercises that you could complete to improve your own skills. 

Do you have questions or inputs, or do you want to know more about topic modeling with MTA and our work in this area? You are welcome to contact me at: christian dot papilloud at soziologie dot uni-halle dot de. 

## Modeling Qualitative Data -- Overview

 1. Goal: learning a general workflow that you can easily implement to modeling qualitative data
 2. Tools: 
     - Anaconda or mainstream Linux operating system 
     - dataset consisting in various text based data
     - low level native Unix programs to preprocess your data
     - python programs to do your analysis
     - python programs to layout your results
     - python programs to make presentations of your results

## What are qualitative data and how to modeling them

1. Qualitative data: mostly text based data of various types (not uniquely, f.ex. also pictures)
    - postcards, vignettes, short texts (f.ex. tweets)
    - newspapers' articles
    - scientific articles
    - books
    - corpora = collections of text data, i.e. big (qualitative) data

2. Modeling techniques: techniques to understand the information content of your dataset

## Several modeling techniques -- Exclusively human-centric techniques

1. human-centric techniques:
   - based on reading the dataset and interpret it
   - categorization of the information content in the dataset
   - structuration of the categories based on the interpretation of dataset
   - result: analysis of this structure = output the hidden/latent structure of the dataset
        
     * advantages: 
       - based on the human understanding of texts = better control over the interpretation of texts
       - sensible to polysemic meaning of words/texts
     * disadvantages:
       - difficult to scale results to the total amount of the investigated material -- main results often apply to 10-15% of the investigated material       
       - difficult to generalized results out of the given dataset
       - difficult to reproduce the results
       - difficult to share the results with other researchers
       - difficult to generalized the results to other sources/actors where the dataset comes from
       - possible interpretation bias
       - time consuming --> often limit the scope of data that can be investigated in a given time
       - complete quantitative oriented research designs, but not compatible with them = parallel routes

## Several modeling techniques -- Mostly human-centric techniques

2. mostly human-centric techniques, with the help of basic computing:
   - based on reading the dataset and interpret it
   - categorization of the information content in the dataset -- computer driven
   - basic statistics (mostly frequencies of words' occurrences and distribution of words)
   - structuration of the categories based on the interpretation of dataset -- computer driven
   - basic structuration tools (f.ex. MAXQDA, NLP techniques)
   - result: analysis of this structure = output the hidden/latent structure of the dataset
   
     * advantages: 
       - based on the human understanding of texts, add computer driven facilities
       - sensible to polysemic meaning of words/texts
       - better at generalizing the results out of the given dataset than exclusively human-centric techniques
       - less time consuming than human-centric techniques
     * disadvantages:
       - difficult to scale results to the total amount of the investigated material -- main results often apply to 10-15% of the investigated material       
       - difficult to reproduce the results
       - difficult to share the results with other researchers
       - difficult to generalized the results to other sources/actors where the dataset comes from
       - possible interpretation bias
       - complete quantitative oriented research designs, but not compatible with them = parallel routes

## Several modeling techniques -- Partly human-centric techniques

3. partly human-centric techniques, partly computer driven:
   - reading the dataset and processing it is computer driven
   - categorization of the information content in the dataset -- computer driven
   - advanced analytics using statistic or mathematic modelling methods
   - structuration of the categories based on the modelling methods
   - advanced structuration tools (f.ex. R, Python)
   - result: analysis of this structure = output the hidden/latent structure of the dataset --> rests on human understanding of the results
   
     * advantages:
       - based on the human understanding of texts, add advanced data analytic
       - better scaling of the results --> apply to the total amount of the investigated material
       - better at generalizing the results out of the given dataset than other human-centric techniques
       - less time consuming than other human-centric techniques
       - results can easily be reproduced
       - results can easily be shared with other researchers
       - better at generalizing the results to other sources/actors where the dataset comes from
       - better at accumulating further data to enrich the dataset
       - better at comparing same kind of data in different languages
       - reduce the interpretation bias
       - better compatibility with quantitative oriented research designs = converging routes
     * disadvantages:
       - less sensible to polysemic meaning of words/texts (even in AI frameworks)
       - knowledge demanding --> skills in programming (which can be time consuming)
       

## Right tools for the right tasks -- Operating systems

1. Why Linux?
  - opensource operating system -- easy to install and maintain at no economic costs
  - take the most out of dated hardware --> reuse your old computers
  - portable -- use the OS on a lot of hardware, as well as from simple external drives or USB keys
  - mainstream software for all mainstream tasks
  - powerful software for data analytic:
    - install R CRAN and related packages 
    - Python comes native with the operating system
    - benefit from native unix low programming utilities to tailor the dataset
    - deliver opensource free software to extend the analytic framework
    
2. Why not Windows or MacOS (or * BSD)?
    - cost of the operating system and the software
    - no portability of the software to other hardware -- you have to stick with one given hardware
    - Windows: no out-of-the-box tools to tailor the dataset --> limited choice of unix tools compatible with Windows
    - MacOS and * BSD flavors: some out-of-the-box tools to tailor the dataset --> not always compatible with same unix tools -- * BSD OS are more involving
    - but: you can install R (directly) and Python (with f.ex. Anaconda) for data analytic
    - but: you can install opensource free software to extend the analytic framework
    
Linux is more often used in the context of data analytic because of its practicability and scaling capacity. Disadvantage: coming from Windows or MacOS, there is a learning curve regarding: 
  - the use of the command line
  - the use of equivalent software to the ones you have on Windows/MacOS --> f.ex. LibreOffice instead of Microsoft Office
  - in general: changing some of your habits

Benefit: gain in autonomy with your research purpose --> you can do and design your work and workflow as you want, i.e. you are not limited by the OS. However, if you want to work with your own non-unix operating system, we provide some advice in these lectures to do so at a minimum involving cost. 

## Right tools for the right tasks -- Software

1. Why Python and not R?
  - both are excellent software and programming environment with a long history and a great community
  - both have a learning curve
  - R -- mainly used for statistics
  - Python -- more general approach to data science
  - R -- you use the flexibility of R libraries
  - Python -- you can write your application from scratch
  - R -- runs locally
  - Python -- better integration with apps and better deployment 

Python is often the first and evident choice when it comes to machine learning framework design -- easy to find support and material for your work

## About these lectures

These lectures are provided in the form of a notebook that you can run and update with your own notes on your computer. 

In order to follow this lecture and to be able to run the code, we recommend the use of jupyter lab. You can install jupyter lab easily with your python distribution and run it privately in a browser window. Using Anaconda, you can install jupyter lab from the Anaconda package manager, or in a (base root) terminal by tipping: 

```
pip install jupyterlab
```

On Linux, open a terminal and enter: 

```
pip3 install jupyterlab
```

Some of the code snippets provided in this notebook are commented, i.e. they have been prefixed with the '#' sign which tells jupyter not to run such a line. You can uncomment those lines, i.e. you can remove this '#' sign in order to see what the code is doing in practice. Don't remove the exclamation mark before the code snippets when you see one of it, because jupyter needs it to run your code.  